Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backend: ProQuest Federated Search Gateway #3991

Open
wants to merge 55 commits into
base: dev
Choose a base branch
from

Conversation

maccabeelevine
Copy link
Member

@maccabeelevine maccabeelevine commented Oct 7, 2024

The ProQuest Federated Search Gateway is an SRU API for searching across research databases licensed via ProQuest. It's old (docs last updated 2016) but still works and per ProQuest customer service:

The Federated Gateway is still in use ... There are no plans to discontinue it and we would be fine with them posting the VuFind code for integration for other customers.

On the plus side, it has a fairly rich CQL syntax for search. On the minus, there are no facets offered other than the constituent databases.

Implementation is patterned after the soon-to-be-deleted WorldCat backend as they both use SRU.

TODO

@maccabeelevine maccabeelevine marked this pull request as ready for review November 6, 2024 16:34
@maccabeelevine
Copy link
Member Author

Other than the TODOs, I think this is far enough to be worth a review. Also worth noting that there is no authentication config, it's simply IP-based access.

Copy link
Contributor

@sturkel89 sturkel89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent some time reviewing the test branch and discussed some of my findings with Demian. Here's a first round of comments on items that are probably quick fixes. Thanks, @maccabeelevine!

Checklist:

  • Change text for results page after open search
  • Rename Source facet to Database
  • Sort values in Source facet by number of results
  • Remove HTML code from some titles in results browse (getTitle vs. getShortTitle)
  • Publication information in record view could be fleshed out
  • Searches from clicking Publisher links in item records don't work
  • Fix Similar Items tab error
  • Staff View tab needs formatting

Change text for results page after open search

An open search just says "No results! Your search - - did not match any resources." Could we change that to custom text asking the user to enter a search term? Demian mentioned that the WorldCat2 backend displays a meaningful message in this situation.

This is the search: http://localhost/vufind_test/ProQuestFSG/Results?lookfor=&type=cql.serverChoice&limit=20&sort=date%2Fascending


Improvements needed to the Source facet

First, can we change the name of this facet to Database instead of Source?

Second, the source facet is sorted in a mysterious order -- it isn't alphabetical nor is it by quantity of hits per database. Can the Source or Database list be sorted by number of results, descending?


HTML code appearing in titles in results browse

For some records, HTML code is visible in the item title in results browse but not in record view. Demian thinks that this is due to the difference between getTitle and getShortTitle in the record driver. Can you fix it? [Additional info: it seems to me that the problematic articles come from the database/source Publicly Available Content Database.]

Example 1:
Results browse:
image

Item record:
image

See also this record.


Publication information display in item records could be improved

Currently, the "Published" field in item record view displays information from the 260 field only. This is usually the publisher and a brief version of the date.

This might be okay for books, but almost everything in PQ is a newspaper or journal article, so the full citation information including publication title, volume, issue, pages, and date should be displayed.

That information is stored in the 773 field. Can we display the 773 field instead, and maybe only show the 260 if the 773 is not present?

Example from the test branch:

Same item as displayed in EDS:

The "Source" information displayed in the EDS record is accurate and is what the patron will want to see. This information is present in the 773 in the ProQuest record on the test branch:

<datafield tag="773" ind1="0" ind2=" "> <subfield code="t">The Kenyon Review</subfield> <subfield code="g">vol. 14, no. 1 (Winter 1992), p. 26-27</subfield>

This is the 260 in the same record:
<datafield tag="260" ind1=" " ind2=" "> <subfield code="b">Kenyon College</subfield> <subfield code="c">Winter 1992</subfield>


Searches from clicking Publisher links in item records don't work

When viewing an item record, you can successfully click a hyperlinked author name or subject term to retrieve other records sharing that term.

When you click a Publisher hyperlink, you get "No results found." This is the URL that fails: http://localhost/vufind_test/ProQuestFSG/Results?type=Publisher&lookfor=Rabbinical%20Council%20of%20America

An advanced search for those terms produces results: http://localhost/vufind_test/ProQuestFSG/Results?join=AND&lookfor0%5B%5D=Rabbinical+Council+of+America&type0%5B%5D=cql.serverChoice&bool0%5B%5D=AND

Demian suggests that you need a custom record driver-specific Publisher link template to fix this. He sent this to help: https://github.com/vufind-org/vufind/blob/dev/themes/bootstrap3/templates/RecordDriver/WorldCat2/link-publisher.phtml


Similar Items tab error

The Similar Items tab in item record view doesn't work; it displays a red box error.

Demian says that you need to create a section for the record driver in RecordTabs.ini and disable 'similar items.'


Staff View tab is not formatted

The contents of the Staff View tab are displayed as a giant scary blob. Demian says that to fix it, you just need to change StaffViewArray to StaffViewMARC in RecordTabs.ini.

image

@@ -0,0 +1,14 @@
<?php foreach ($data as $field): ?>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented @sturkel89 's feedback to add the MARC 773 info:

Publication information display in item records could be improved

I did not yet do what was suggested to display either the 773 info OR the 260 as a backup -- they are both being displayed. The publisher field ("Kenyon College" here) is unique to the 260.
The publication date (Winter 1992 here) does seem to be duplicated between the two, so in theory I could skip that on the Published line. But I'm not sure how opinionated to be about that.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's a little ugly to repeat the information, I think the 260 and 773 fields are semantically different and won't necessarily always include redundant information. I'd be inclined to continue displaying both in the interest of completeness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree. Could we change the heading "Published" though to "Published by" to distinguish from "Published in"? I know the date part is not technically "by" but I think it could still work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea, just a question of how to implement it in the most translation-friendly way (i.e. do we revise existing keys or create a new key? Is the existing key used in multiple contexts that might have different semantics?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would have to be a new key; I see 'Published' used for two different meanings in RecordDataFormatterFactory alone, referencing getPublicationDetails and getDateSpan.

All that said, I think this would be a separate PR as it affects other backends and others might want to weigh in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree, it's out of scope for this PR but worth looking at separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense and I appreciate the need to consider language strings/translation. I am going to post follow-up comments with other suggestions for tweaking results browse and record view after I do some more analysis of PQ content and fields.

@@ -62,6 +62,7 @@
}

// Comma-separate formatting
.record .format { display: inline-flex; }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This eliminates the whitespace before the comma, which comes from the HTML simply having line breaks in between the spans. I didn't see any negative side effects but couldn't find another backend where it was used to test -- obviously missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, we go through contortions to avoid having line breaks before commas in our templates because of this problem. If using inline-flex is a viable solution, then we might be able to make some of our templates less ugly by wrapping lines, etc. :-)

@maccabeelevine
Copy link
Member Author

The API offers two "expanders". I've decided (for now) not to implement either.

  • x-synonyms sounds like it would search for any synonyms of words, but per the docs it only actually searches for US vs UK spelling equivalents. It's true that I can't get it to work on any of what I would consider synonym words. So describing it in the UI as "synonyms" would be misleading, and I don't know if I actually trust the docs in the first place.
  • x-lemmatize is useful (searching on alternate forms of a word, i.e. tall/taller/tallest), but it's on by default and I can't see a good use case for turning it off. I also can't see how to explain it in a brief checkbox label (without a tooltip).

@maccabeelevine
Copy link
Member Author

There seems to be no highlighting support in the API, so dropping that TODO possibility.

@sturkel89
Copy link
Contributor

Decode HTML entities

You made a change above to strip HTML tags from titles in results browse. That eliminated HTML tags within angle brackets as in the example I gave above.

However, other HTML entities are still visible in results browse and in item records, including &amp;, &lt; and &gt;.

Examples (first record, second record):

and

@demiankatz suggests that wherever you use "strip tags," you should also decode HTML entities. Thanks!

@sturkel89
Copy link
Contributor

Source lightbox?

I'd like to be able to bring up a lightbox so I could sort and filter the list of database sources, as we can sort and filter some facet groups in "regular" VuFind. This would make it easier to include or exclude the groups of historical newspaper sources, for example, and would be extra-great when we have the ability to apply multi-filter selection and deselection.

Is it possible? (NB: This functionality doesn't seem to be available for any facet group in EDS, so maybe it's not possible when there are huge lists of facet values.)

@sturkel89
Copy link
Contributor

sturkel89 commented Dec 13, 2024

Record display suggestions

Display Database in record
Can we add the Database/Source to the record view, toward the bottom of the main section (as in EDS at Lehigh and at Villanova)?


Display DOI in record
If it's available, can we display the DOI in either the main part of the record or the Description tab? We have it in the Description field in EDS at Lehigh, and in both the main part of the record and in the Description tab in EDS at Villanova.


Hide date in Published field in record
I've reviewed the display of publication information in a lot of records for a variety of publication types.

In every case, the 260 field (usually publisher name and publication date) provides identical or less specific date information than the date information that's present in the 773. (260 populates the Published area, and 773 populates the Published In area.)

I suggest we try displaying ONLY 260, subfield b for the "Published" field. We'll still get the completeness from including the publisher name as well as the journal/publication date, and eliminate the redundancy and clutter of repeating the date.

@sturkel89
Copy link
Contributor

Results browse suggestions

Display full Published in data on results page
I'd like to see FULL "Published In" data from the 773 displayed in results browse, rather than only the publication name and date. If you implement this, it would add volume, issue, and page number information where available.


Hide second copy of ebook titles on results page
When the item is an ebook, the 245|a (title) field tends to be identical to the 773|t (published in|title) field. This looks weird in record browse:

Can we do a comparison of those two fields, and if they're identical can we either suppress the 773 value from appearing in record browse, or can we replace the 773 with the 786 t (database source)?


Abbreviate date everywhere for ebook records
It looks weird to me to have the full publication date (month, day, year) appear in ebook records; it makes me think that the item is a newspaper article that is tied to a very specific publication date. Year would be sufficient! I think the pub year is present in the first four numeric characters in the 045 field. Would it be possible to make that replacement if the source is Ebook Central, or based on some other qualifier?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants