Skip to content

BibliographicMetadata

leonardr edited this page Oct 28, 2014 · 40 revisions

Types of metadata

Information about the book ("objective")

This category includes most of the common things you might want to know about a book:

  • Title
  • Description
  • Authors
  • Identifiers, such as ISBN
  • Which work it's an edition of
  • Which series the book is in, and where it fits in that series.
  • Format
  • Publisher
  • Language
  • Classifications
  • Excerpts
  • Cover art

It also includes information about the content of the book:

  • Table of contents
  • Bibliography
  • Characters in a work of fiction
  • Real-life events and people mentioned in the book
  • Other books mentioned in the book's bibliography

"Objective" is in quotes because some things like classification and description can get a little subjective.

Information about readers' interactions with the book ("subjective")

  • Ratings
  • The text of reviews (reader reviews and professional reviews)
  • Popularity rankings (current and historical)
  • Awards

"Subjective" is in quotes because some things like the popularity of popularity of a book can be measured objectively.

Facts about our inventory

It's essential to get this information from paid ebook providers. Ideally we would have all this information, but some of it isn't currently available:

  • How many licenses we own.
  • How many loans are left on each of those licenses.
  • How many licenses are checked out, and when each checkout happened.
  • How many licenses are reserved for one particular patron, and when each reservation happened.
  • How many people are in the hold queue, and when they joined the hold queue.

Metadata sources

Instead of the standardized vocabularies I'll cover later, most of these use sources serve XML or JSON documents that use custom vocabularies. To figure out exactly what the XML tag names and JSON field names mean, you must consult some nearby piece of documentation.

NYPL MARC

  • Includes topics, genre, summary, and more.
  • A MARC XML record Compare the same data in HTML
  • Matt is converting our MARC catalog to JSON. We'll see what data we have available in there once he's done.

OCLC Classify

  • A free API that can turn title/author or isbn into internal "work ids" or OCLC IDs. If we get internal "work ids", we can always turn them into a list of OCLC IDs. TODO: nail down terms of use. noncommercial?

  • Author information is linked to VIAF (e.g.), which makes it possible to localize author names.

Worldcat Open Data

Significant amounts of bibliographic data available in RDF format, available once you know the OCLC ID. Data is made available under an attribution-style license

  • Example RDF
  • Author information is linked to VIAF (e.g.), which makes it possible to localize author names.

OCLC xID

Maps between OCLC number, ISBN, OCLC work ID (owi), and LCCN.

This API is extremely prone to 500 errors. I don't think there's any reason to use it given the existence of WorldCat Open Data.

Syndetics

This is a pay service but we are already paying for it. I don't have login/password credentials, so I can't get documentation or support, but I do know our API credential.

We can look up significant amounts of metadata (review, summary, excerpt) given ISBN or OCLC number.

3M XML

A vocabulary for describing 3M Cloud Library's catalog. It looks to have been autogenerated from the names used in C# classes on 3M's side.

  • Includes minimal objective metadata
  • Includes no subjective metadata
  • Includes some inventory metadata

OverDrive JSON

Overdrive's Metadata API serves JSON documents. Its other APIs also use this vocabulary when describing books. Documents are served as a custom media type, application/vnd.overdrive.api+json, but I don't think the media type is formally defined anywhere.

  • Includes decent objective metadata
  • Supposedly includes subjective metadata, but it's pretty sparse
  • The availability API includes some inventory metadata.

Gutenberg Bookshelves

Project Gutenberg has a number of manually-created wiki bookshelves which divide books up by category. See for instance the Children's Bookshelf.

BiblioCommons

Bibliocommons's Titles API serves JSON documents that use custom terms like "authors", "isbns", and "title". Each term is explicitly defined in the API documentation.

Documents are served as application/json. The internal format of the JSON documents is not explicitly defined.

  • Includes minimal objective metadata
  • Includes no subjective metadata

Working with Bibliocommons API

Anything that can be constructed as an Advanced Search can be structured for a paramaterized title search.

As discussed in [http://developer.bibliocommons.com/forum/read/155663](forum thread):

Queries from advanced search can be passed using search_type=custom. For example:

https://api.bibliocommons.com/v1/titles.json?library=gvpl&search_type=custom&q=anywhere%3A(moondog)%20%20formatcode%3A(BK%20OR%20EBOOK)&api_key={key}

Harvard LibraryCloud

JSON or MARC format. Some data comes from OCLC but some (e.g. "ShelfRank", a rating of community engagement a.k.a. popularity) is original.

GoodReads

XML format.

  • Includes good objective metadata
  • Includes good subjective metadata, including reviews, but "harvesting and indexing" data is forbidden without permission.

Booki.sh

Owned by Zola. We already buy access to the Bookish recommendation API. They also have a basic metadata API. I'm not sure if we get access to the metadata API under the same terms as the recommendation API.

Coverage is about 1.5 million books. Metadata comes from publishers and distributors. Primary key is ISBN13.

Zola is working on an API to (truncated) reviews of books and to estimate star ratings. Also working on a search API to map (e.g.) title/author to ISBN.

iDreamBooks

"Metacritic for books". They have a JSON-based API. We can get a list of recent critically acclaimed books and books that were recently featured on TV. For a given book we can get up to 5 critical reviews.

API TOS are oriented around displaying reviews directly to end-users.

  • You agree to not use the API to redistribute, harvest or index iDreamBooks' data without our explicit consent.
  • You agree to not truncate, modify or change our data.
  • You may not store our data except for caching purposes.

WikiData

  • (Example: "The Art of War")[http://www.wikidata.org/wiki/Q8251] Good source for data about books (mostly PD books) We always get author information and link to Wikipedia page. If we're lucky we might get related images, author images, or summary. (First paragraph of Wikipedia page makes a decent summary in many cases.)

Amazon Product Advertising API

XML format.

  • Includes good objective metadata.
  • Includes very good subjective metadata.

TOS forbids usage except to promote Amazon's products. I believe we have more favorable usage terms if we scrape the website as a spider rather than going through the API.

The SNAP dataset has Amazon user reviews up to March 2013. We might be able to use it--I don't know what the terms are.

NYT

USA Today

  • Book review API
  • Best-selling books API (apparently only for "personal, noncommercial use")
  • TOS "Your use of the USAT Census API, the USAT Books/Music/Movie Reviews API, and the USAT Articles API is not limited to your personal, noncommercial use and you may use those specific USAT APIs for commercial purposes."

LibraryThing

Probably the most useful [set of bibliographic APIs](https://www.librarything.com/wiki/index.php/LibraryThing_APIs].

  • LibraryThing "What Work" looks up the LibraryThing "work" for an ISBN or title/author.

  • ThingTitle is similar but gives a little more information.

  • Once we have the work ID, we can tie this into (ck.getwork)[http://www.librarything.com/services/rest/documentation/1.1/librarything.ck.getwork.php) to get access to Common Knowledge facts about the book.

  • Bunch of data feeds I haven't looked at.

  • Very good objective metadata

  • Very good subjective metadata, but reviews are behind a TOS-wall.

Content Cafe

Provides a variety of metadata. It's a paid service, but we're already paying for it.

Cover Art Sources

Vocabularies

All of these metadata sources use different made-up vocabularies to talk about the same real-world objects (books).

Let's define a vocabulary as an agreed-upon set of real-world semantics for otherwise meaningless strings. On its own, the word "date" is ambiguous. It could refer to an object, an event, or a point in time. To a computer, "date" is just a four-character string. It doesn't mean anything at all. But in the Dublin Core vocabulary, "date" has a precise meaning: it always refers to a point in time.

Two systems can only communicate with each other if they share a vocabulary. Most vocabularies are ad hoc custom vocabularies designed for one specific system. For instance, the 3M API uses terms like "EventStartDateInUTC" whose meanings are defined only in the documentation for the 3M API. (And sometimes not even then--terms like "PhysicalISBN" are never formally defined.)

There are many, many vocabularies for describing books. It's probably the most common type of vocabulary, after vocabularies for describing people. For books, our problem is an abundance of standardized vocabularies on top of the abundance of ad hoc vocabularies.

Some vocabularies are tied to a format (ONIX, Atom). Some are tied not only to a format but to a specific piece of software (OverDrive, 3M). Some are format-neutral (Dublin Core, http://schema.org/Book). But each vocabulary has an ideology: it encapsulates the language used by a particular relationship.

For example, http://schema.org/Book encapsulates the language a webmaster uses when talking a book to a search engine. ONIX encapsulates the language a publisher uses when talking about a book to a bookstore. They are both talking about books, but at different levels of detail and for different purposes.

When deciding which vocabulary or vocabularies to support we need to start with the question: who are we communicating with? What is our relationship with them?

Vocabularies that are tied to format

MARC

MARC is the LoC's famous heavyweight vocabulary for bibliographic information. There is a binary serialization and an XML serialization. Both use concise terms like "100" and "x" rather than human-readable terms like "author" and "subcategory"; these form MARC's "content designation".

Project Gutenberg serves MARC records for its books, generated from its RDF catalog.

The 3M API will serve an XML MARC record for purchased ebooks ("Get MARC"). I don't think Overdrive offers this feature.

ONIX

ONIX is "intended to support computer-to-computer communication between parties involved in creating, distributing, licensing or otherwise making available intellectual property in published form, whether physical or digital." It is a heavyweight format like MARC, but MARC is optimized for library catalog and ONIX seems more general, intended for business-to-business use.

ONIX is an XML vocabulary. It seems to define both MARC-like abbreviated terms like "b203" and equivalent human-readable terms like "TitleText".

Atom

Atom is a media type based on XML ("application/atom+xml") originally designed for serializing blog posts. It defines basic, generic terms like "author", "category", "published", and "title".

Although spartan on its own, Atom is extensible through a "profile" mechanism, and has been widely extended. Notably by OPDS, which defines an Atom profile for catalogs of ebooks.

Format-independent vocabularies

Most of these are RDF vocabularies, which can be serialized in a number of forms, used in Linked Data applications, and included in a number of other formats (because pretty much everything in an RDF vocabulary is a URI).

Schema.org vocabularies are RDF vocabularies that can also be expressed in HTML 5 microdata. Its "Book" vocabulary defines basic terms like "author", "illustrator", "isbn", "copyrightYear", and "genre".

Dublin Core

A basic "vocabulary of fifteen properties for use in resource description", first created in 1995. These include "subject", "creator", and "title". Since 1995 many additional terms have been added to Dublin Core, like "isReplacedBy" and "abstract".

Project Gutenberg uses Dublin Core's recommendation for presenting bibliographic information in RDF. This means a lot of DCMI Metadata Terms and limited use of the DCMI Abstract Model's "memberOf" term.

Miscellaneous

In addition to Dublin Core, Project Gutenberg's RDF documents also include one element from Creative Commons (for licensing), plus a custom Project Gutenberg vocabulary containing miscellaneous information about authors, ebook numbers, and download locations.

Side-by-side comparison

This table compares five major vocabularies for ebook sources, plus one source for catalog data (BiblioCommons). It focuses on the terms we care most about for identifying titles, tracking our inventory, and downloading books once they've been checked out. We can add to this table as we investigate more APIs.

Field Atom+OPDS Overdrive 3M Axis 360 Gutenberg BiblioCommons
Internal ID atom:id or dc:identifier id ItemId titleId The URL that pgterms:ebook is rdf:about id
Permalink atom:id links[self], links[availability], links[metadata] BookLinkURL titleUrl Same as internal ID details_url
ISBN-13 dc:identifier may be a urn:isbn: URI, but probably not metadata->formats[identifiers] ISBN13 isbn - isbns
Title atom:title title Title productTitle dc:title title
Subtitle usually part of dc:title subtitle Subtitle - included as part of dc:title sub_title
Series - (some vendors may include with dc:title) series - series - (sometimes "Part II" etc. in dc:title) series
Author atom:author, atom:creator primaryCreator, metadata->creators Authors (list--separated how?) contributor (list--separated how?) dc:creator, with a pgterms:agent entry for each authors, additional_contributors
Description atom:summary (text only) or atom:content (usually HTML, often has buy form/download link and other misc metadata) metadata->shortDescription Description (HTML format) - - description
Publisher dc:publisher metadata->publisher Publisher publisher dc:publisher (it's always "Project Gutenberg") publishers
Imprint - metadata->imprint - imprint n/a -
Publication date dc:issued (atom:published for when added to OPDS catalog) metadata->publishDate, metadata->publishDateText PubDate publicationDate dc:issued publication_date
Language dc:language metadata->languages Language language dc:language primary_language, languages
Reader Rating - metadata->starRating, metadata->popularity - (tracked, but not published through the API) - - -
Classifications atom:Category metadata->subjects, metadata->gradeLevels - subject, audience ("General Adult") dc:subject (LCC=Library of Congress Classification, LCSH=Library of Congress Subject Headings) suitabilities
Cover image link with rel="http://opds-spec.org/thumbnail"/"http://opds-spec.org/cover" /"x-stanza-cover-image"/"x-stanza-cover-image-thumbnail" images[thumbnail], metadata->images[cover] (contentreserve.com) CoverLinkURL (very high quality) - - -
Download link odps:indirectAcquisition for DRM-encrypted stuff. Otherwise link rel="http://opds-spec.org/acquisition" or "http://opds-spec.org/acquisition/open-access" or "http://opds-spec.org/acquisition/borrow" contentLink (using Download API) NO WAY downloadUrl from a 'checkout' call The rdf:resource of a dc:hasFormat tag n/a
Download link (sample) link rel="http://opds-spec.org/acquisition/sample" metadata->samples - - n/a n/a
Format 'type' of acquisition link metadata->formats BookFormat availability/availableFormats dcam:hasFormat, each with a pgterms:file stanza n/a
File size - metadata->formats[fileSize] Size fileSize (but seems to only exist for audio books) pgterms:file/dc:extent n/a
Purchased copies - availability->copiesOwned TotalCopies availability/totalCopies n/a "copies" API for physical copies
Available copies - availability->copiesAvailable AvailableCopies availability/availableCopies n/a "copies" API for physical copies
Size of hold queue - availability->numberOfHolds OnHoldCount availability/holdQueueSize n/a n/a
Other atom:rights supersedes dc:rights. Library use case probably has to be worked out on a case-by-case basis. metadata->awards, metadata->reviews, metadata->popularity, metadata->sortTitle NumberOfPages (always seems to be missing or wrong), PhysicalISBN annotation ("ENGLISH"), addedDate (i.e. date added to inventory), minLoanPeriod, maxLoanPeriod, availability/updateDate, availability/Checkouts, availability/Holds

Availability information per patron: holdQueuePosition, isInHoldQueue, isReserved, reservedEndDate, isCheckedout, checkoutFormat, checkoutStartDate, checkoutEndDate, downloadURl

dc:rights (some PG books are copyrighted and may have restrictions on distribution, but it's done on a per-book basis and spot checks turn up nothing special) pages, edition, contents (table of contents)
Clone this wiki locally