Allow to run solr-indexing without an existing collection #289

nichtich · 2023-06-28T08:17:09Z

We would like to use qa-catalogue with an existing SolrCloud with limited access (qa-catalogue should not be allowed to create and rename collections). As far as I understand the current solr indexing process involves:

prepare-solr to create two new solr collections ($db and ${db}_dev) and adjust their schema
index to delete index, prepare scheme and actually index data into ${name}_dev, and finally optimize the index and swap names.

The required use case is as following instead:

we manually set up a solr index at SolrCloud, not running at localhost
qa-catalogue only purges the index and indexes all data but no change of schemas or collection names

I suppose the following extension to configuration is needed:

new configuration variable SOLR to point to a solr collection via URL (e.g. http://example.org:8983/solr/qa-catalogue)
new option to solr-scripts (prepare-solr and index) to disable the ${db}_dev collection and any changes to the schema

The text was updated successfully, but these errors were encountered:

nichtich · 2023-08-16T06:42:33Z

If Solr is also used during analysis for temporary results, there might be two Solr instances: one local for doing current analysis and one with pre-configured collection running at another host.

pkiraly · 2023-08-29T19:21:52Z

estimation: ~~2 hours~~

note: this was a rough, and not a realistic estimation. The task is more complex than I first thought.

pkiraly · 2023-10-20T07:42:07Z

This comment is just for recording the current situation:

In backend of the indexing process there are two Java parameters: solrUrl and validationUrl these refer to the Solr index (e.g. http://localhost:8983/solr/loc) and not the Solr service (http://localhost:8983/solr). The index script passes the value of these parameters:

--solrUrl ${SOLR_DB_URL} where SOLR_DB_URL="${SOLR_HOST}/solr/${CORE}"
--validationUrl ${SOLR_HOST}/solr/${validationCore} where validationCore is a command line parameter of the script, that is set by the common-script as ${NAME}_validation if the --groupBy is set to 1.

Here the index process only reads from the validation index, and enhances the bibliographical record with it, then stores the result into the main index.

As part of validation process the calculate-aggregated-numbers.grouped.sh script creates the validation index simply for improving the performance. Reading from Solr is faster than reading from Sqlite3, and if the dataset is large (that is the case with K10plus), the difference has an impact on the user interface. This process happens only if the --groupBy parameter is set to 1.

In summary:

validation process creates the validation index that contains the issues found per records
index process merges the validation index to the main index. After that it is not needed anymore. Now one can search Solr for both bibliographic content and validation issues.
it happens under a specific condition, if the --groupBy parameter is set to 1.

In the web interface two new configuration options has been introduced: mainSolrEndpoint and solrEndpoint4ValidationResults -- the default for both are http://localhost:8983/solr/.

Conclusion:

the parameters could be explicit with default values as fallback, if there is no explicit parameter settings
the validation indexing should be triggered by an explicit flag
on the long run the index should contain additional information from other analyses e.g. completeness scores, so the name validation is misleading. We should call differently, e.g. scores, so the parameter should be something like solrForScoresUrl
if this auxiliary index is merged, we can delete it
the backend and frontend parameters should have the same name and semantics

Based on this I reset the estimation to 10 hours.

…g postprocess-solr

pkiraly · 2023-10-28T15:10:25Z

@nichtich I moved the schema preparation into prepare-solr and the swap of NAME and NAME_dev into a postprocess-solr script, so we have 3 indexing related tasks:

prepare-solr: if they are not already there it creates two indices, and prepares the schema
index: purely runs the indexing task
postprocess-solr: swap indexes.

You have to run only the index command and not all-solr (which triggers all 3 tasks).

For experimenting the settings I created a new flag --onlyIndex. If it is set the index will use the main index, otherwise it will use the _dev index:

  ONLY_INDEX=$(echo ${TYPE_PARAMS} | grep -c -P -e '--onlyIndex' || true)
  if [[ "${ONLY_INDEX}" == "0" ]]; then
    CORE=${NAME}_dev
  else
    CORE=${NAME}
  fi

If you still want to use two Solr indices, and you are fine with the NAME_dev index, only you want to manage the schema and swapping the indices independent from the script please let me know, because then this last piece is not needed at all.

Note: the feature is not yet done.

pkiraly · 2023-11-16T10:32:56Z

another TODO:

fix calculate-aggregated-numbers.grouped.sh

…ng the issue around 'validation' Solr instance

…g the issue around 'validation' Solr instance

pkiraly · 2023-11-26T11:15:30Z

@nichtich Now it is dependent on the configuration, and hopefully there is no hidden hardwired variables in the code. In order to prevent index creation use --onlyIndex. To specify the "validation" index, use --solrForScoresUrl [URL of Solr core] (otherwise it will use localhost:8983 as host name and NAME_validation as Solr core.

…g the issue around 'validation' Solr instance

pkiraly · 2024-03-14T17:03:48Z

@nichtich Had you have a chance to check it? In the README I provided somewhat more detailed documentation. I guess this issue is fixed, but I would like to hear your opinion.

nichtich added the enhancement label Jun 28, 2023

pkiraly self-assigned this Jun 28, 2023

pkiraly added this to the PICA: 1.3 milestone Jun 28, 2023

pkiraly added this to PICA Jun 28, 2023

nichtich mentioned this issue Jun 28, 2023

Support configuration of Solr API endpoint pkiraly/qa-catalogue-web#109

Open

nichtich added the priority:high label Aug 16, 2023

pkiraly added a commit that referenced this issue Oct 20, 2023

Allow to run solr-indexing without an existing collection #289

5f7554b

pkiraly added a commit that referenced this issue Oct 24, 2023

Allow to run solr-indexing without an existing collection #289

178fd6a

pkiraly added a commit that referenced this issue Oct 28, 2023

Allow to run solr-indexing without an existing collection #289: addin…

a8df0ac

…g postprocess-solr

pkiraly added a commit that referenced this issue Oct 28, 2023

Allow to run solr-indexing without an existing collection #289: addin…

e4529cb

…g postprocess-solr

pkiraly added a commit that referenced this issue Nov 26, 2023

Allow to run solr-indexing without an existing collection #289: fixi…

94e7ff1

…ng the issue around 'validation' Solr instance

pkiraly added a commit that referenced this issue Nov 26, 2023

Allow to run solr-indexing without an existing collection #289: fixin…

f05aca5

…g the issue around 'validation' Solr instance

pkiraly added a commit that referenced this issue Nov 26, 2023

Allow to run solr-indexing without an existing collection #289: fixin…

d11c30f

…g the issue around 'validation' Solr instance

pkiraly moved this to 👀 In review in PICA May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to run solr-indexing without an existing collection #289

Allow to run solr-indexing without an existing collection #289

nichtich commented Jun 28, 2023

nichtich commented Aug 16, 2023

pkiraly commented Aug 29, 2023 •

edited

Loading

pkiraly commented Oct 20, 2023 •

edited

Loading

pkiraly commented Oct 28, 2023 •

edited

Loading

pkiraly commented Nov 16, 2023 •

edited

Loading

pkiraly commented Nov 26, 2023

pkiraly commented Mar 14, 2024

Allow to run solr-indexing without an existing collection #289

Allow to run solr-indexing without an existing collection #289

Comments

nichtich commented Jun 28, 2023

nichtich commented Aug 16, 2023

pkiraly commented Aug 29, 2023 • edited Loading

pkiraly commented Oct 20, 2023 • edited Loading

pkiraly commented Oct 28, 2023 • edited Loading

pkiraly commented Nov 16, 2023 • edited Loading

pkiraly commented Nov 26, 2023

pkiraly commented Mar 14, 2024

pkiraly commented Aug 29, 2023 •

edited

Loading

pkiraly commented Oct 20, 2023 •

edited

Loading

pkiraly commented Oct 28, 2023 •

edited

Loading

pkiraly commented Nov 16, 2023 •

edited

Loading