Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add solr field with number of times a field is used in a record #342

Open
nichtich opened this issue Nov 1, 2023 · 5 comments
Open

Add solr field with number of times a field is used in a record #342

nichtich opened this issue Nov 1, 2023 · 5 comments

Comments

@nichtich
Copy link
Collaborator

nichtich commented Nov 1, 2023

There seems to be no way to query Solr index for fields alone (except for fields that don't have subfields). To query:

  • records that include a given field
  • records that include a given field more than once
  • records that include a given field x times

This should be doable with an additional index field holding the number of times, a record field is used in the record.

@pkiraly
Copy link
Owner

pkiraly commented Nov 1, 2023

For the name and subject authority PICA/MARC fields there are Solr fields which concatenates all non administrative subfields. They are called *_full_ss. For these fields you can query records that include a given field, you can use this Solr field (e.g. for Allgemeine Systematik für Bibliotheken: 045B_full_ss:*). But evidently this is not what you would like to get.

Staying with this example, we can create a Solr field that has the name as the name of the PICA/MARC field plus a prefix/suffix (say count or instances), and the value is an integer. The queries would look like these:

  • records that include a given field: 045B_count_i:*
  • records that include a given field more than once: 045B_count_i:[2 TO *]
  • records that include a given field x times: 045B_count_i:x
    (_i suffix a Solr dynamic field name pattern for integers)

Is it OK for you? Would you like to see another prefix/suffix or no suffix?

@nichtich
Copy link
Collaborator Author

nichtich commented Nov 1, 2023

I prefer short names such as 045B_i but 045B_count_i:* is ok as well.

@pkiraly
Copy link
Owner

pkiraly commented Nov 1, 2023

Thanks! I'd add count or similar because from 045B_i my first association is that it contains a value of the field (e.g. a year, or page number) transformed into an integer.

@pkiraly pkiraly self-assigned this Nov 2, 2023
@pkiraly pkiraly added this to PICA Nov 2, 2023
@pkiraly pkiraly added this to the PICA: 1.3 milestone Nov 2, 2023
@pkiraly pkiraly moved this to 👀 In review in PICA Nov 2, 2023
@pkiraly
Copy link
Owner

pkiraly commented Nov 2, 2023

It is testable. You should add --indexFieldCounts flag, otherwise the index will not contain the counts.

It uses the field's id + _count_i as the Solr name. The id connects same fields having different occurences, and separates same tags, but different fields. @ and - has been transformed to _. Please suggest alternatives if you dislike this approach. Here is the result (a Solr document):

{
"id": "010531483",
...,
"001__count_i":1,
"001A_count_i":1,
"001B_count_i":1,
"001U_count_i":1,
"002__count_i":1,
"002C_count_i":1,
"002D_count_i":1,
"002E_count_i":1,
"003__count_i":1,
"003O_count_i":1,
"003S_count_i":1,
"003T_count_i":1,
"004A_count_i":1,
"006G_count_i":1,
"006U_count_i":1,
"007G_count_i":1,
"009__count_i":1,
"010__count_i":1,
"011__count_i":1,
"017G_count_i":1,
"017L_count_i":11,
"019__count_i":1,
"021A_count_i":1,
"028A_count_i":1,
"032__count_i":1,
"033A_count_i":1,
"034D_count_i":1,
"034M_count_i":1,
"044K_00_09_count_i":2,
"045E_count_i":1,
"045R_count_i":1,
"046K_count_i":1,
"046X_count_i":1,
"047A_count_i":1,
...
}

@pkiraly
Copy link
Owner

pkiraly commented Dec 19, 2024

In the last commits I improved it a bit and added subfield counts. However since there might be repeatable fields and them might have non repeatable subfields, to sum them up would give false result from validation perspective the index stores the subfield count as a list:

  • "bib036F7_count_is": [2] - there is one instance of 036F, but it has two $7 instances
  • "bib017La_count_is": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - there are several instances of 017L, but each have only a single $a instance

New indexing related parameters:

  • --indexFieldCounts: a flag to denote indexing the count of field instances
  • --indexSubfieldCounts: a flag to denote indexing the count of subfield instances
  • --fieldPrefix <prefix>: a field prefix (see note below)

Notes:

  • *_is: is a dynamic Solr field that contains multiple integer values
  • bib: here it is a prefix for Solr fields. If the Solr field starts with a number it is not possible to retrieve in some queries, e.g. in fl (the list of fields to retrieve if you do not want to retrieve the whole record). Adding an alphabetic prefix (even one character) solves this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 👀 In review
Development

No branches or pull requests

2 participants