Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic mechanisms for using/caching external entities #1009

Draft
wants to merge 47 commits into
base: develop
Choose a base branch
from

Conversation

olovy
Copy link
Contributor

@olovy olovy commented Oct 14, 2021

  • Cache external entities in lddb as entities mapped to the local vocabulary (KBV in our case).

  • Store them with original/external id as thing id (similar to id.kb.se things) and with the new record type CacheRecord.

  • Cache records are created automatically when any document is saved with a new link to something that can be understood/mapped by the system.

  • Only store cache records for things that are used (linked) directly from our data. So that they can be searched/linked like any other record. Links to other external things inside cache records are not stored in the system but converted on the fly in embellish (we don't want to pull in the whole of Wikidata recursively by mistake).

  • Create a PlaceholderRecord for all links in the system that point to something that is not in the system. For keeping track of links that point at something outside lddb that we don't have a cached entity for. e.g. things we can't map, things that don't exist, links from cache records. So that everything is kept up to date when data is added in different order (lddb__dependencies etc.).

    • e.g. cache records. First Skellefteå, Umeå and Lycksele are used and later Västerbotten (which the former are isPartOf)
    • e.g. links to stuff that we don't understand now but later add mappings for. e.g. if we first link LCSH terms then we add a mapper for them.
  • Mapping is a black box (URI -> KBV entity) that now contains hand-crafted code, later TVM or a mix of approaches

  • Add basic mapping for Wikidata places and persons

TODO

  • TBD: Create placeholder for all existing links (script)? Probably yes. Otherwise they will be created on demand which works for the CacheRecords use.
  • TBD: Cache records and placeholder could be deleted automatically when not used anymore. Not done at the moment.
  • Cache/proxy all external requests on HTTP level. To be shared by all whelk instances.
  • Error handling for external requests (I/O etc.)
  • Add parallelized ExternalEntities get(Collection<String> iris)

Depends on libris/definitions/pull/324

@olovy olovy requested review from jannistsiroyannis and removed request for jannistsiroyannis October 14, 2021 12:06
@olovy olovy marked this pull request as ready for review October 14, 2021 13:21
@olovy
Copy link
Contributor Author

olovy commented Oct 14, 2021

At the moment we have 30 243 811 links to 192 330 entities that would become placeholders.

including
17 155 217 to 180 different https://libris.kb.se/sys/globalchanges/...
2 272 071 to bad library URIs)
699 729 to 87 090 different http://id.loc.gov/authorities/...

http://xlbuild.libris.kb.se/404-STATISTICS.txt (14MB)

@olovy olovy force-pushed the feature/lxl-2483-external-data branch 5 times, most recently from 5135d6a to f898f7f Compare October 29, 2021 16:24
@olovy olovy marked this pull request as draft November 18, 2021 14:25
@olovy olovy force-pushed the feature/lxl-2483-external-data branch from 86bacf8 to cd16448 Compare November 25, 2021 10:39
@olovy olovy force-pushed the feature/lxl-2483-external-data branch from 2ff3472 to d4de301 Compare January 20, 2022 19:19
@olovy
Copy link
Contributor Author

olovy commented Jan 20, 2022

Rebase on develop

@olovy olovy force-pushed the feature/lxl-2483-external-data branch from d4de301 to 301bb68 Compare April 11, 2023 12:28
@olovy
Copy link
Contributor Author

olovy commented Apr 11, 2023

Rebase on develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants