-
Notifications
You must be signed in to change notification settings - Fork 4
BibliographicMetadata
This category includes most of the common things you might want to know about a book:
- Title
- Description
- Authors
- Identifiers, such as ISBN
- Which work it's an edition of
- Which series the book is in, and where it fits in that series.
- Format
- Publisher
- Language
- Classifications
- Excerpts
- Cover art
It also includes information about the content of the book:
- Table of contents
- Bibliography
- Characters in a work of fiction
- Real-life events and people mentioned in the book
- Other books mentioned in the book's bibliography
"Objective" is in quotes because some things like classification and description can get a little subjective.
- Ratings
- The text of reviews (reader reviews and professional reviews)
- Popularity rankings (current and historical)
- Awards
"Subjective" is in quotes because some things like the popularity of popularity of a book can be measured objectively.
It's essential to get this information from paid ebook providers. Ideally we would have all this information, but some of it isn't currently available:
- How many licenses we own.
- How many loans are left on each of those licenses.
- How many licenses are checked out, and when each checkout happened.
- How many licenses are reserved for one particular patron, and when each reservation happened.
- How many people are in the hold queue, and when they joined the hold queue.
Instead of the standardized vocabularies I'll cover later, most of these use sources serve XML or JSON documents that use custom vocabularies. To figure out exactly what the XML tag names and JSON field names mean, you must consult some nearby piece of documentation.
- Includes topics, genre, summary, and more.
- A MARC XML record Compare the same data in HTML
- Matt is converting our MARC catalog to JSON. We'll see what data we have available in there once he's done.
-
A free API that can turn title/author or isbn into internal "work ids" or OCLC IDs. If we get internal "work ids", we can always turn them into a list of OCLC IDs. TODO: nail down terms of use. noncommercial?
-
Author information is linked to VIAF (e.g.), which makes it possible to localize author names.
Significant amounts of bibliographic data available in RDF format, available once you know the OCLC ID. Data is made available under an attribution-style license
- Example RDF
- Author information is linked to VIAF (e.g.), which makes it possible to localize author names.
Maps between OCLC number, ISBN, OCLC work ID (owi), and LCCN.
- Documentation for lookup by OCLC number
- Documentation for lookup by OCLC work id
- Example
- Terms of service Non-commercial use that does not exceed 1k accesses per day. More access is allowed via subscription. This requires a WorldCat affiliate account.
This API is extremely prone to 500 errors. I don't think there's any reason to use it given the existence of WorldCat Open Data.
This is a pay service but we are already paying for it. I don't have login/password credentials, so I can't get documentation or support, but I do know our API credential.
We can look up significant amounts of metadata (review, summary, excerpt) given ISBN or OCLC number.
A vocabulary for describing 3M Cloud Library's catalog. It looks to have been autogenerated from the names used in C# classes on 3M's side.
- Includes minimal objective metadata
- Includes no subjective metadata
- Includes some inventory metadata
Overdrive's Metadata API serves JSON documents. Its other APIs also use this vocabulary when describing books. Documents are served as a custom media type, application/vnd.overdrive.api+json, but I don't think the media type is formally defined anywhere.
- Includes decent objective metadata
- Supposedly includes subjective metadata, but it's pretty sparse
- The availability API includes some inventory metadata.
Project Gutenberg has a number of manually-created wiki bookshelves which divide books up by category. See for instance the Children's Bookshelf.
Bibliocommons's Titles API serves JSON documents that use custom terms like "authors", "isbns", and "title". Each term is explicitly defined in the API documentation.
Documents are served as application/json. The internal format of the JSON documents is not explicitly defined.
- Includes minimal objective metadata
- Includes no subjective metadata
Anything that can be constructed as an Advanced Search can be structured for a paramaterized title search.
As discussed in [http://developer.bibliocommons.com/forum/read/155663](forum thread):
Queries from advanced search can be passed using search_type=custom. For example:
JSON or MARC format. Some data comes from OCLC but some (e.g. "ShelfRank", a rating of community engagement a.k.a. popularity) is original.
XML format.
- Includes good objective metadata
- Includes good subjective metadata, including reviews, but "harvesting and indexing" data is forbidden without permission.
Owned by Zola. We already buy access to the Bookish recommendation API. They also have a basic metadata API. I'm not sure if we get access to the metadata API under the same terms as the recommendation API.
Coverage is about 1.5 million books. Metadata comes from publishers and distributors. Primary key is ISBN13.
Zola is working on an API to (truncated) reviews of books and to estimate star ratings. Also working on a search API to map (e.g.) title/author to ISBN.
"Metacritic for books". They have a JSON-based API. We can get a list of recent critically acclaimed books and books that were recently featured on TV. For a given book we can get up to 5 critical reviews.
API TOS are oriented around displaying reviews directly to end-users.
- You agree to not use the API to redistribute, harvest or index iDreamBooks' data without our explicit consent.
- You agree to not truncate, modify or change our data.
- You may not store our data except for caching purposes.
- (Example: "The Art of War")[http://www.wikidata.org/wiki/Q8251] Good source for data about books (mostly PD books) We always get author information and link to Wikipedia page. If we're lucky we might get related images, author images, or summary. (First paragraph of Wikipedia page makes a decent summary in many cases.)
XML format.
- Includes good objective metadata.
- Includes very good subjective metadata.
TOS forbids usage except to promote Amazon's products. I believe we have more favorable usage terms if we scrape the website as a spider rather than going through the API.
The SNAP dataset has Amazon user reviews up to March 2013. We might be able to use it--I don't know what the terms are.
- Best Sellers API - Basic historical popularity information. Includes links to NYT reviews where appropriate.
- Article Search API - Can be used to search for book reviews
- Book review API
- Best-selling books API (apparently only for "personal, noncommercial use")
- TOS "Your use of the USAT Census API, the USAT Books/Music/Movie Reviews API, and the USAT Articles API is not limited to your personal, noncommercial use and you may use those specific USAT APIs for commercial purposes."
Probably the most useful [set of bibliographic APIs](https://www.librarything.com/wiki/index.php/LibraryThing_APIs].
-
LibraryThing "What Work" looks up the LibraryThing "work" for an ISBN or title/author.
-
ThingTitle is similar but gives a little more information.
-
Once we have the work ID, we can tie this into (ck.getwork)[http://www.librarything.com/services/rest/documentation/1.1/librarything.ck.getwork.php) to get access to Common Knowledge facts about the book.
-
Very good objective metadata
-
Very good subjective metadata, but reviews are behind a TOS-wall.
Provides a variety of metadata. It's a paid service, but we're already paying for it.
- Book records from 3M and Overdrive include links to covers for the book.
- We can get covers from LibraryThing for free, given an ISBN. We can only request one cover per second, and a max of 1,000 per day.
- Open Library has bulk downloads of covers.
- Syndetic Solutions
- ChileFresh.com
- Content Cafe
- Recovering the Classics
All of these metadata sources use different made-up vocabularies to talk about the same real-world objects (books).
Let's define a vocabulary as an agreed-upon set of real-world semantics for otherwise meaningless strings. On its own, the word "date" is ambiguous. It could refer to an object, an event, or a point in time. To a computer, "date" is just a four-character string. It doesn't mean anything at all. But in the Dublin Core vocabulary, "date" has a precise meaning: it always refers to a point in time.
Two systems can only communicate with each other if they share a vocabulary. Most vocabularies are ad hoc custom vocabularies designed for one specific system. For instance, the 3M API uses terms like "EventStartDateInUTC" whose meanings are defined only in the documentation for the 3M API. (And sometimes not even then--terms like "PhysicalISBN" are never formally defined.)
There are many, many vocabularies for describing books. It's probably the most common type of vocabulary, after vocabularies for describing people. For books, our problem is an abundance of standardized vocabularies on top of the abundance of ad hoc vocabularies.
Some vocabularies are tied to a format (ONIX, Atom). Some are tied not only to a format but to a specific piece of software (OverDrive, 3M). Some are format-neutral (Dublin Core, http://schema.org/Book). But each vocabulary has an ideology: it encapsulates the language used by a particular relationship.
For example, http://schema.org/Book encapsulates the language a webmaster uses when talking a book to a search engine. ONIX encapsulates the language a publisher uses when talking about a book to a bookstore. They are both talking about books, but at different levels of detail and for different purposes.
When deciding which vocabulary or vocabularies to support we need to start with the question: who are we communicating with? What is our relationship with them?
MARC is the LoC's famous heavyweight vocabulary for bibliographic information. There is a binary serialization and an XML serialization. Both use concise terms like "100" and "x" rather than human-readable terms like "author" and "subcategory"; these form MARC's "content designation".
Project Gutenberg serves MARC records for its books, generated from its RDF catalog.
The 3M API will serve an XML MARC record for purchased ebooks ("Get MARC"). I don't think Overdrive offers this feature.
ONIX is "intended to support computer-to-computer communication between parties involved in creating, distributing, licensing or otherwise making available intellectual property in published form, whether physical or digital." It is a heavyweight format like MARC, but MARC is optimized for library catalog and ONIX seems more general, intended for business-to-business use.
ONIX is an XML vocabulary. It seems to define both MARC-like abbreviated terms like "b203" and equivalent human-readable terms like "TitleText".
Atom is a media type based on XML ("application/atom+xml") originally designed for serializing blog posts. It defines basic, generic terms like "author", "category", "published", and "title".
Although spartan on its own, Atom is extensible through a "profile" mechanism, and has been widely extended. Notably by OPDS, which defines an Atom profile for catalogs of ebooks.
Most of these are RDF vocabularies, which can be serialized in a number of forms, used in Linked Data applications, and included in a number of other formats (because pretty much everything in an RDF vocabulary is a URI).
Schema.org vocabularies are RDF vocabularies that can also be expressed in HTML 5 microdata. Its "Book" vocabulary defines basic terms like "author", "illustrator", "isbn", "copyrightYear", and "genre".
A basic "vocabulary of fifteen properties for use in resource description", first created in 1995. These include "subject", "creator", and "title". Since 1995 many additional terms have been added to Dublin Core, like "isReplacedBy" and "abstract".
Project Gutenberg uses Dublin Core's recommendation for presenting bibliographic information in RDF. This means a lot of DCMI Metadata Terms and limited use of the DCMI Abstract Model's "memberOf" term.
In addition to Dublin Core, Project Gutenberg's RDF documents also include one element from Creative Commons (for licensing), plus a custom Project Gutenberg vocabulary containing miscellaneous information about authors, ebook numbers, and download locations.
This table compares five major vocabularies for ebook sources, plus one source for catalog data (BiblioCommons). It focuses on the terms we care most about for identifying titles, tracking our inventory, and downloading books once they've been checked out. We can add to this table as we investigate more APIs.
Field | Atom+OPDS | Overdrive | 3M | Axis 360 | Gutenberg | BiblioCommons |
---|---|---|---|---|---|---|
Internal ID | atom:id or dc:identifier | id | ItemId | titleId | The URL that pgterms:ebook is rdf:about | id |
Permalink | atom:id | links[self], links[availability], links[metadata] | BookLinkURL | titleUrl | Same as internal ID | details_url |
ISBN-13 | dc:identifier may be a urn:isbn: URI, but probably not | metadata->formats[identifiers] | ISBN13 | isbn | - | isbns |
Title | atom:title | title | Title | productTitle | dc:title | title |
Subtitle | usually part of dc:title | subtitle | Subtitle | - | included as part of dc:title | sub_title |
Series | - (some vendors may include with dc:title) | series | - | series | - (sometimes "Part II" etc. in dc:title) | series |
Author | atom:author, atom:creator | primaryCreator, metadata->creators | Authors (list--separated how?) | contributor (list--separated how?) | dc:creator, with a pgterms:agent entry for each | authors, additional_contributors |
Description | atom:summary (text only) or atom:content (usually HTML, often has buy form/download link and other misc metadata) | metadata->shortDescription | Description (HTML format) | - | - | description |
Publisher | dc:publisher | metadata->publisher | Publisher | publisher | dc:publisher (it's always "Project Gutenberg") | publishers |
Imprint | - | metadata->imprint | - | imprint | n/a | - |
Publication date | dc:issued (atom:published for when added to OPDS catalog) | metadata->publishDate, metadata->publishDateText | PubDate | publicationDate | dc:issued | publication_date |
Language | dc:language | metadata->languages | Language | language | dc:language | primary_language, languages |
Reader Rating | - | metadata->starRating, metadata->popularity | - (tracked, but not published through the API) | - | - | - |
Classifications | atom:Category | metadata->subjects, metadata->gradeLevels | - | subject, audience ("General Adult") | dc:subject (LCC=Library of Congress Classification, LCSH=Library of Congress Subject Headings) | suitabilities |
Cover image | link with rel="http://opds-spec.org/thumbnail"/"http://opds-spec.org/cover" /"x-stanza-cover-image"/"x-stanza-cover-image-thumbnail" | images[thumbnail], metadata->images[cover] (contentreserve.com) | CoverLinkURL (very high quality) | - | - | - |
Download link | odps:indirectAcquisition for DRM-encrypted stuff. Otherwise link rel="http://opds-spec.org/acquisition" or "http://opds-spec.org/acquisition/open-access" or "http://opds-spec.org/acquisition/borrow" | contentLink (using Download API) | NO WAY | downloadUrl from a 'checkout' call | The rdf:resource of a dc:hasFormat tag | n/a |
Download link (sample) | link rel="http://opds-spec.org/acquisition/sample" | metadata->samples | - | - | n/a | n/a |
Format | 'type' of acquisition link | metadata->formats | BookFormat | availability/availableFormats | dcam:hasFormat, each with a pgterms:file stanza | n/a |
File size | - | metadata->formats[fileSize] | Size | fileSize (but seems to only exist for audio books) | pgterms:file/dc:extent | n/a |
Purchased copies | - | availability->copiesOwned | TotalCopies | availability/totalCopies | n/a | "copies" API for physical copies |
Available copies | - | availability->copiesAvailable | AvailableCopies | availability/availableCopies | n/a | "copies" API for physical copies |
Size of hold queue | - | availability->numberOfHolds | OnHoldCount | availability/holdQueueSize | n/a | n/a |
Other | atom:rights supersedes dc:rights. Library use case probably has to be worked out on a case-by-case basis. | metadata->awards, metadata->reviews, metadata->popularity, metadata->sortTitle | NumberOfPages (always seems to be missing or wrong), PhysicalISBN | annotation ("ENGLISH"), addedDate (i.e. date added to inventory), minLoanPeriod, maxLoanPeriod, availability/updateDate, availability/Checkouts, availability/Holds
Availability information per patron: holdQueuePosition, isInHoldQueue, isReserved, reservedEndDate, isCheckedout, checkoutFormat, checkoutStartDate, checkoutEndDate, downloadURl |
dc:rights (some PG books are copyrighted and may have restrictions on distribution, but it's done on a per-book basis and spot checks turn up nothing special) | pages, edition, contents (table of contents) |