Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to run solr-indexing without an existing collection #289

Open
nichtich opened this issue Jun 28, 2023 · 7 comments
Open

Allow to run solr-indexing without an existing collection #289

nichtich opened this issue Jun 28, 2023 · 7 comments

Comments

@nichtich
Copy link
Collaborator

We would like to use qa-catalogue with an existing SolrCloud with limited access (qa-catalogue should not be allowed to create and rename collections). As far as I understand the current solr indexing process involves:

The required use case is as following instead:

  • we manually set up a solr index at SolrCloud, not running at localhost
  • qa-catalogue only purges the index and indexes all data but no change of schemas or collection names

I suppose the following extension to configuration is needed:

  • new configuration variable SOLR to point to a solr collection via URL (e.g. http://example.org:8983/solr/qa-catalogue)
  • new option to solr-scripts (prepare-solr and index) to disable the ${db}_dev collection and any changes to the schema
@nichtich
Copy link
Collaborator Author

If Solr is also used during analysis for temporary results, there might be two Solr instances: one local for doing current analysis and one with pre-configured collection running at another host.

@pkiraly
Copy link
Owner

pkiraly commented Aug 29, 2023

estimation: 2 hours

note: this was a rough, and not a realistic estimation. The task is more complex than I first thought.

@pkiraly
Copy link
Owner

pkiraly commented Oct 20, 2023

This comment is just for recording the current situation:

In backend of the indexing process there are two Java parameters: solrUrl and validationUrl these refer to the Solr index (e.g. http://localhost:8983/solr/loc) and not the Solr service (http://localhost:8983/solr). The index script passes the value of these parameters:

  • --solrUrl ${SOLR_DB_URL} where SOLR_DB_URL="${SOLR_HOST}/solr/${CORE}"
  • --validationUrl ${SOLR_HOST}/solr/${validationCore} where validationCore is a command line parameter of the script, that is set by the common-script as ${NAME}_validation if the --groupBy is set to 1.

Here the index process only reads from the validation index, and enhances the bibliographical record with it, then stores the result into the main index.

As part of validation process the calculate-aggregated-numbers.grouped.sh script creates the validation index simply for improving the performance. Reading from Solr is faster than reading from Sqlite3, and if the dataset is large (that is the case with K10plus), the difference has an impact on the user interface. This process happens only if the --groupBy parameter is set to 1.

In summary:

  • validation process creates the validation index that contains the issues found per records
  • index process merges the validation index to the main index. After that it is not needed anymore. Now one can search Solr for both bibliographic content and validation issues.
  • it happens under a specific condition, if the --groupBy parameter is set to 1.

In the web interface two new configuration options has been introduced: mainSolrEndpoint and solrEndpoint4ValidationResults -- the default for both are http://localhost:8983/solr/.

Conclusion:

  • the parameters could be explicit with default values as fallback, if there is no explicit parameter settings
  • the validation indexing should be triggered by an explicit flag
  • on the long run the index should contain additional information from other analyses e.g. completeness scores, so the name validation is misleading. We should call differently, e.g. scores, so the parameter should be something like solrForScoresUrl
  • if this auxiliary index is merged, we can delete it
  • the backend and frontend parameters should have the same name and semantics

Based on this I reset the estimation to 10 hours.

@pkiraly
Copy link
Owner

pkiraly commented Oct 28, 2023

@nichtich I moved the schema preparation into prepare-solr and the swap of NAME and NAME_dev into a postprocess-solr script, so we have 3 indexing related tasks:

  • prepare-solr: if they are not already there it creates two indices, and prepares the schema
  • index: purely runs the indexing task
  • postprocess-solr: swap indexes.

You have to run only the index command and not all-solr (which triggers all 3 tasks).

For experimenting the settings I created a new flag --onlyIndex. If it is set the index will use the main index, otherwise it will use the _dev index:

  ONLY_INDEX=$(echo ${TYPE_PARAMS} | grep -c -P -e '--onlyIndex' || true)
  if [[ "${ONLY_INDEX}" == "0" ]]; then
    CORE=${NAME}_dev
  else
    CORE=${NAME}
  fi

If you still want to use two Solr indices, and you are fine with the NAME_dev index, only you want to manage the schema and swapping the indices independent from the script please let me know, because then this last piece is not needed at all.

Note: the feature is not yet done.

@pkiraly
Copy link
Owner

pkiraly commented Nov 16, 2023

another TODO:

  • fix calculate-aggregated-numbers.grouped.sh

pkiraly added a commit that referenced this issue Nov 26, 2023
…ng the issue around 'validation' Solr instance
pkiraly added a commit that referenced this issue Nov 26, 2023
…g the issue around 'validation' Solr instance
@pkiraly
Copy link
Owner

pkiraly commented Nov 26, 2023

@nichtich Now it is dependent on the configuration, and hopefully there is no hidden hardwired variables in the code. In order to prevent index creation use --onlyIndex. To specify the "validation" index, use --solrForScoresUrl [URL of Solr core] (otherwise it will use localhost:8983 as host name and NAME_validation as Solr core.

pkiraly added a commit that referenced this issue Nov 26, 2023
…g the issue around 'validation' Solr instance
@pkiraly
Copy link
Owner

pkiraly commented Mar 14, 2024

@nichtich Had you have a chance to check it? In the README I provided somewhat more detailed documentation. I guess this issue is fixed, but I would like to hear your opinion.

@pkiraly pkiraly moved this to 👀 In review in PICA May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 👀 In review
Development

No branches or pull requests

2 participants