BibliographicMetadata

Types of metadata

Information about the book ("objective")

This category includes most of the common things you might want to know about a book:

Title
Description
Authors
Identifiers, such as ISBN
Which work it's an edition of
Which series the book is in, and where it fits in that series.
Format
Publisher
Language
Classifications
Excerpts
Cover art

It also includes information about the content of the book:

Table of contents
Bibliography
Characters in a work of fiction
Real-life events and people mentioned in the book
Other books mentioned in the book's bibliography

"Objective" is in quotes because some things like classification and description can get a little subjective.

Information about readers' interactions with the book ("subjective")

Ratings
The text of reviews (reader reviews and professional reviews)
Popularity rankings (current and historical)
Awards

"Subjective" is in quotes because some things like the popularity of popularity of a book can be measured objectively.

Facts about our inventory

It's essential to get this information from paid ebook providers. Ideally we would have all this information, but some of it isn't currently available:

How many licenses we own.
How many loans are left on each of those licenses.
How many licenses are checked out, and when each checkout happened.
How many licenses are reserved for one particular patron, and when each reservation happened.
How many people are in the hold queue, and when they joined the hold queue.

Metadata sources

Instead of the standardized vocabularies I'll cover later, most of these use sources serve XML or JSON documents that use custom vocabularies. To figure out exactly what the XML tag names and JSON field names mean, you must consult some nearby piece of documentation.

NYPL MARC

Includes topics, genre, summary, and more.
A MARC XML record Compare the same data in HTML
Matt is converting our MARC catalog to JSON. We'll see what data we have available in there once he's done.

OCLC Classify

A free API that can turn title/author or isbn into internal "work ids" or OCLC IDs. If we get internal "work ids", we can always turn them into a list of OCLC IDs. TODO: nail down terms of use. noncommercial?
Author information is linked to VIAF (e.g.), which makes it possible to localize author names.

Worldcat Open Data

Significant amounts of bibliographic data available in RDF format, available once you know the OCLC ID. Data is made available under an attribution-style license

Example RDF
Author information is linked to VIAF (e.g.), which makes it possible to localize author names.

OCLC xID

Maps between OCLC number, ISBN, OCLC work ID (owi), and LCCN.

Documentation for lookup by OCLC number
Documentation for lookup by OCLC work id
Example
Terms of service Non-commercial use that does not exceed 1k accesses per day. More access is allowed via subscription. This requires a WorldCat affiliate account.

This API is extremely prone to 500 errors. I don't think there's any reason to use it given the existence of WorldCat Open Data.

Syndetics

This is a pay service but we are already paying for it. I don't have login/password credentials, so I can't get documentation or support, but I do know our API credential.

We can look up significant amounts of metadata (review, summary, excerpt) given ISBN or OCLC number.

Someone else's copy of the documentation I can't find elsewhere

3M XML

A vocabulary for describing 3M Cloud Library's catalog. It looks to have been autogenerated from the names used in C# classes on 3M's side.

Includes minimal objective metadata
Includes no subjective metadata
Includes some inventory metadata

OverDrive JSON

Overdrive's Metadata API serves JSON documents. Its other APIs also use this vocabulary when describing books. Documents are served as a custom media type, application/vnd.overdrive.api+json, but I don't think the media type is formally defined anywhere.

Includes decent objective metadata
Supposedly includes subjective metadata, but it's pretty sparse
The availability API includes some inventory metadata.

Gutenberg Bookshelves

Project Gutenberg has a number of manually-created wiki bookshelves which divide books up by category. See for instance the Children's Bookshelf.

BiblioCommons

Bibliocommons's Titles API serves JSON documents that use custom terms like "authors", "isbns", and "title". Each term is explicitly defined in the API documentation.

Documents are served as application/json. The internal format of the JSON documents is not explicitly defined.

Includes minimal objective metadata
Includes no subjective metadata

Working with Bibliocommons API

Anything that can be constructed as an Advanced Search can be structured for a paramaterized title search.

As discussed in [http://developer.bibliocommons.com/forum/read/155663](forum thread):

Queries from advanced search can be passed using search_type=custom. For example:
https://api.bibliocommons.com/v1/titles.json?library=gvpl&search_type=custom&q=anywhere%3A(moondog)%20%20formatcode%3A(BK%20OR%20EBOOK)&api_key={key}

Harvard LibraryCloud

JSON or MARC format. Some data comes from OCLC but some (e.g. "ShelfRank", a rating of community engagement a.k.a. popularity) is original.

GoodReads

XML format.

Includes good objective metadata
Includes good subjective metadata, including reviews, but "harvesting and indexing" data is forbidden without permission.

Booki.sh

Owned by Zola. We already buy access to the Bookish recommendation API. They also have a basic metadata API. I'm not sure if we get access to the metadata API under the same terms as the recommendation API.

Coverage is about 1.5 million books. Metadata comes from publishers and distributors. Primary key is ISBN13.

Zola is working on an API to (truncated) reviews of books and to estimate star ratings. Also working on a search API to map (e.g.) title/author to ISBN.

iDreamBooks

"Metacritic for books". They have a JSON-based API. We can get a list of recent critically acclaimed books and books that were recently featured on TV. For a given book we can get up to 5 critical reviews.

API TOS are oriented around displaying reviews directly to end-users.

You agree to not use the API to redistribute, harvest or index iDreamBooks' data without our explicit consent.
You agree to not truncate, modify or change our data.
You may not store our data except for caching purposes.

WikiData

(Example: "The Art of War")[http://www.wikidata.org/wiki/Q8251] Good source for data about books (mostly PD books) We always get author information and link to Wikipedia page. If we're lucky we might get related images, author images, or summary. (First paragraph of Wikipedia page makes a decent summary in many cases.)

Amazon Product Advertising API

XML format.

Includes good objective metadata.
Includes very good subjective metadata.

TOS forbids usage except to promote Amazon's products. I believe we have more favorable usage terms if we scrape the website as a spider rather than going through the API.

The SNAP dataset has Amazon user reviews up to March 2013. We might be able to use it--I don't know what the terms are.

NYT

Best Sellers API - Basic historical popularity information. Includes links to NYT reviews where appropriate.
Article Search API - Can be used to search for book reviews

USA Today

Book review API
Best-selling books API (apparently only for "personal, noncommercial use")
TOS "Your use of the USAT Census API, the USAT Books/Music/Movie Reviews API, and the USAT Articles API is not limited to your personal, noncommercial use and you may use those specific USAT APIs for commercial purposes."

LibraryThing

Probably the most useful [set of bibliographic APIs](https://www.librarything.com/wiki/index.php/LibraryThing_APIs].

LibraryThing "What Work" looks up the LibraryThing "work" for an ISBN or title/author.
ThingTitle is similar but gives a little more information.
Once we have the work ID, we can tie this into (ck.getwork)[http://www.librarything.com/services/rest/documentation/1.1/librarything.ck.getwork.php) to get access to Common Knowledge facts about the book.
Bunch of data feeds I haven't looked at.
Very good objective metadata
Very good subjective metadata, but reviews are behind a TOS-wall.

Content Cafe

Provides a variety of metadata. It's a paid service, but we're already paying for it.

Cover Art Sources

Book records from 3M and Overdrive include links to covers for the book.
We can get covers from LibraryThing for free, given an ISBN. We can only request one cover per second, and a max of 1,000 per day.
Open Library has bulk downloads of covers.
Syndetic Solutions
ChileFresh.com
Content Cafe
Recovering the Classics

Vocabularies

All of these metadata sources use different made-up vocabularies to talk about the same real-world objects (books).

Let's define a vocabulary as an agreed-upon set of real-world semantics for otherwise meaningless strings. On its own, the word "date" is ambiguous. It could refer to an object, an event, or a point in time. To a computer, "date" is just a four-character string. It doesn't mean anything at all. But in the Dublin Core vocabulary, "date" has a precise meaning: it always refers to a point in time.

Two systems can only communicate with each other if they share a vocabulary. Most vocabularies are ad hoc custom vocabularies designed for one specific system. For instance, the 3M API uses terms like "EventStartDateInUTC" whose meanings are defined only in the documentation for the 3M API. (And sometimes not even then--terms like "PhysicalISBN" are never formally defined.)

There are many, many vocabularies for describing books. It's probably the most common type of vocabulary, after vocabularies for describing people. For books, our problem is an abundance of standardized vocabularies on top of the abundance of ad hoc vocabularies.

Some vocabularies are tied to a format (ONIX, Atom). Some are tied not only to a format but to a specific piece of software (OverDrive, 3M). Some are format-neutral (Dublin Core, http://schema.org/Book). But each vocabulary has an ideology: it encapsulates the language used by a particular relationship.

For example, http://schema.org/Book encapsulates the language a webmaster uses when talking a book to a search engine. ONIX encapsulates the language a publisher uses when talking about a book to a bookstore. They are both talking about books, but at different levels of detail and for different purposes.

When deciding which vocabulary or vocabularies to support we need to start with the question: who are we communicating with? What is our relationship with them?

Vocabularies that are tied to format

MARC

MARC is the LoC's famous heavyweight vocabulary for bibliographic information. There is a binary serialization and an XML serialization. Both use concise terms like "100" and "x" rather than human-readable terms like "author" and "subcategory"; these form MARC's "content designation".

Project Gutenberg serves MARC records for its books, generated from its RDF catalog.

The 3M API will serve an XML MARC record for purchased ebooks ("Get MARC"). I don't think Overdrive offers this feature.

ONIX

ONIX is "intended to support computer-to-computer communication between parties involved in creating, distributing, licensing or otherwise making available intellectual property in published form, whether physical or digital." It is a heavyweight format like MARC, but MARC is optimized for library catalog and ONIX seems more general, intended for business-to-business use.

ONIX is an XML vocabulary. It seems to define both MARC-like abbreviated terms like "b203" and equivalent human-readable terms like "TitleText".

Atom

Atom is a media type based on XML ("application/atom+xml") originally designed for serializing blog posts. It defines basic, generic terms like "author", "category", "published", and "title".

Although spartan on its own, Atom is extensible through a "profile" mechanism, and has been widely extended. Notably by OPDS, which defines an Atom profile for catalogs of ebooks.

Format-independent vocabularies

Most of these are RDF vocabularies, which can be serialized in a number of forms, used in Linked Data applications, and included in a number of other formats (because pretty much everything in an RDF vocabulary is a URI).

http://schema.org/Book

Schema.org vocabularies are RDF vocabularies that can also be expressed in HTML 5 microdata. Its "Book" vocabulary defines basic terms like "author", "illustrator", "isbn", "copyrightYear", and "genre".

Definition

Dublin Core

A basic "vocabulary of fifteen properties for use in resource description", first created in 1995. These include "subject", "creator", and "title". Since 1995 many additional terms have been added to Dublin Core, like "isReplacedBy" and "abstract".

Project Gutenberg uses Dublin Core's recommendation for presenting bibliographic information in RDF. This means a lot of DCMI Metadata Terms and limited use of the DCMI Abstract Model's "memberOf" term.

Miscellaneous

In addition to Dublin Core, Project Gutenberg's RDF documents also include one element from Creative Commons (for licensing), plus a custom Project Gutenberg vocabulary containing miscellaneous information about authors, ebook numbers, and download locations.

Side-by-side comparison

This table compares five major vocabularies for ebook sources, plus one source for catalog data (BiblioCommons). It focuses on the terms we care most about for identifying titles, tracking our inventory, and downloading books once they've been checked out. We can add to this table as we investigate more APIs.

Field	Atom+OPDS	Overdrive	3M	Axis 360	Gutenberg	BiblioCommons
Internal ID	atom:id or dc:identifier	id	ItemId	titleId	The URL that pgterms:ebook is rdf:about	id
Permalink	atom:id	links[self], links[availability], links[metadata]	BookLinkURL	titleUrl	Same as internal ID	details_url
ISBN-13	dc:identifier may be a urn:isbn: URI, but probably not	metadata->formats[identifiers]	ISBN13	isbn	-	isbns
Title	atom:title	title	Title	productTitle	dc:title	title
Subtitle	usually part of dc:title	subtitle	Subtitle	-	included as part of dc:title	sub_title
Series	- (some vendors may include with dc:title)	series	-	series	- (sometimes "Part II" etc. in dc:title)	series
Author	atom:author, atom:creator	primaryCreator, metadata->creators	Authors (list--separated how?)	contributor (list--separated how?)	dc:creator, with a pgterms:agent entry for each	authors, additional_contributors
Description	atom:summary (text only) or atom:content (usually HTML, often has buy form/download link and other misc metadata)	metadata->shortDescription	Description (HTML format)	-	-	description
Publisher	dc:publisher	metadata->publisher	Publisher	publisher	dc:publisher (it's always "Project Gutenberg")	publishers
Imprint	-	metadata->imprint	-	imprint	n/a	-
Publication date	dc:issued (atom:published for when added to OPDS catalog)	metadata->publishDate, metadata->publishDateText	PubDate	publicationDate	dc:issued	publication_date
Language	dc:language	metadata->languages	Language	language	dc:language	primary_language, languages
Reader Rating	-	metadata->starRating, metadata->popularity	- (tracked, but not published through the API)	-	-	-
Classifications	atom:Category	metadata->subjects, metadata->gradeLevels	-	subject, audience ("General Adult")	dc:subject (LCC=Library of Congress Classification, LCSH=Library of Congress Subject Headings)	suitabilities
Cover image	link with rel="http://opds-spec.org/thumbnail"/"http://opds-spec.org/cover" /"x-stanza-cover-image"/"x-stanza-cover-image-thumbnail"	images[thumbnail], metadata->images[cover] (contentreserve.com)	CoverLinkURL (very high quality)	-	-	-
Download link	odps:indirectAcquisition for DRM-encrypted stuff. Otherwise link rel="http://opds-spec.org/acquisition" or "http://opds-spec.org/acquisition/open-access" or "http://opds-spec.org/acquisition/borrow"	contentLink (using Download API)	NO WAY	downloadUrl from a 'checkout' call	The rdf:resource of a dc:hasFormat tag	n/a
Download link (sample)	link rel="http://opds-spec.org/acquisition/sample"	metadata->samples	-	-	n/a	n/a
Format	'type' of acquisition link	metadata->formats	BookFormat	availability/availableFormats	dcam:hasFormat, each with a pgterms:file stanza	n/a
File size	-	metadata->formats[fileSize]	Size	fileSize (but seems to only exist for audio books)	pgterms:file/dc:extent	n/a
Purchased copies	-	availability->copiesOwned	TotalCopies	availability/totalCopies	n/a	"copies" API for physical copies
Available copies	-	availability->copiesAvailable	AvailableCopies	availability/availableCopies	n/a	"copies" API for physical copies
Size of hold queue	-	availability->numberOfHolds	OnHoldCount	availability/holdQueueSize	n/a	n/a
Other	atom:rights supersedes dc:rights. Library use case probably has to be worked out on a case-by-case basis.	metadata->awards, metadata->reviews, metadata->popularity, metadata->sortTitle	NumberOfPages (always seems to be missing or wrong), PhysicalISBN	annotation ("ENGLISH"), addedDate (i.e. date added to inventory), minLoanPeriod, maxLoanPeriod, availability/updateDate, availability/Checkouts, availability/Holds Availability information per patron: holdQueuePosition, isInHoldQueue, isReserved, reservedEndDate, isCheckedout, checkoutFormat, checkoutStartDate, checkoutEndDate, downloadURl	dc:rights (some PG books are copyrighted and may have restrictions on distribution, but it's done on a per-book basis and spot checks turn up nothing special)	pages, edition, contents (table of contents)