-
Notifications
You must be signed in to change notification settings - Fork 1
/
guidebook.txt
2523 lines (1661 loc) · 211 KB
/
guidebook.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Linked Archival Metadata: A Guidebook
version 0.99999
--
LiAM & Eric L. Morgan
http://sites.tufts.edu/liam/
April 23, 2014
Executive Summary
Linked data is a process for embedding the descriptive information of archives into the very fabric of the Web. By transforming archival description into linked data, an archivist will enable other people as well as computers to read and use their archival description, even if the others are not a part of the archival community. The process goes both ways. Linked data also empowers archivists to use and incorporate the information of other linked data providers into their local description. This enables archivists to make their descriptions more thorough, more complete, and more value-added. For example, archival collections could be automatically supplemented with geographic coordinates in order to make maps, images of people or additional biographic descriptions to make collections come alive, or bibliographies for further reading.
Publishing and using linked data does not represent a change in the definition of archival description, but it does represent an evolution of how archival description is accomplished. For example, linked data is not about generating a document such as EAD file. Instead it is about asserting sets of statements about an archival thing, and then allowing those statements to be brought together in any number of ways for any number of purposes. A finding aid is one such purpose. Indexing is another purpose. For use by a digital humanist is anther purpose. While EAD files are encoded as XML documents and therefore very computer readable, the reader must know the structure of EAD in order to make the most out of the data. EAD is archives-centric. The way data is manifested in linked data is domain-agnostic.
The objectives of archives include collection, organization, preservation, description, and often times access to unique materials. Linked data is about description and access. By taking advantages of linked data principles, archives will be able to improve their descriptions and increase access. This will require a shift in the way things get done but not what gets done. The goal remains the same.
Many tools are ready exist for transforming data in existing formats into linked data. This data can reside in Excel spreadsheets, database applications, MARC records, or EAD files. There are tiers of linked data publishing so one does not have to do everything all at once. But to transform existing information or to maintain information over the long haul requires the skills of many people: archivists & content specialists, administrators & managers, metadata specialists & catalogers, computer programers & systems administrators.
Moving forward with linked data is a lot like touristing to Rome. There are many ways to get there, and there are many things to do once you arrive, but the result will undoubtably improve your ability to participate in the discussion of the human condition on a world wide scale.
Acknowledgements
The creation of this guidebook is really the effort of many people, not a single author. First and foremost, thanks goes to Anne Sauer and Eliot Wilczek. Anne saw the need, spearheaded the project, and made it a reality. Working side-by-side was Eliot who saw the guidebook to fruition. Then there is the team of people from the Working Group: Greg Colati, Karen Gracy, Corey Harper, Michelle Light, Susan Pyzynski, Aaron Rubinstein, Ed Summers, Kenneth Thibodeau, and Kathy Wisser. These people made themselves available for discussion and clarification. They provided balance between human archival practice and seemingly cold computer technology. Additional people offered their advice into the issues of linked data in archives including: Kevin Cawley, Diane Hillman, Mark Matienzo, Ross Singer, and Jane & Ade Stevenson. They filled in many gaps. Special thanks goes to my boss Tracey Bergstrom who more than graciously enabled some of this work to happen on company time. The questions raised by the anonymized library school student are definitely appreciated. The Code4Lib community was very helpful -- a sort of reality check. And then there are the countless other people who listened over and over again about what linked data is, is not, and what it can do. A big "thank you" goes to all.
Introduction
"Let's go to Rome!"
The purpose of this guidebook is to describe in detail what linked data is, why it is important, how you can publish it to the Web, how you can take advantage of your linked data, and how you can exploit the linked data of others. For the archivist, linked data is about universally making accessible and repurposing sets of facts about you and your collections. As you publish these fact you will be able to maintain a more flexible Web presence as well as a Web presence that is richer, more complete, and better integrated with complementary collections. The principles behind linked data are inextricably bound to the inner workings of the Web. It is standard of practice that will last as long as the Web lasts, and that will be long into the foreseeable future. The process of publishing linked data in no way changes the what of archival practice, but it does shift how archival practice is accomplished. Linked data enables you to tell the stories of your collections, and it does in a way enabling you to reach much wider and more global audiences.
Linked Archival Metadata: A Guidebook provides archivists with an overview of the current linked data landscape, defines basic concepts, identifies practical strategies for adoption, and emphasizes the tangible payoffs for archives implementing linked data. It focuses on clarifying why archives and archival users can benefit from linked data and will identify a graduated approach to applying linked data methods to archival description.
"Let's go to Rome!" This book uses the metaphor of a guidebook, specifically a guidebook describing a trip to Rome. Any trip abroad requires an understanding of the place where you intend to go, some planning, some budgeting, and some room for adventure. You will need to know how you are going to get there, where you are going to stay, what you are going to eat, what you want to see and do once you arrive, and what you might want to bring back as souvenirs. You behooves you to know a bit of the history, customs, and language. This is also true of your adventure to Linked Data Land. To that end, this guidebook is divided into the following sections:
* What is linked data and why should I care? - A description of
the beauty of Rome and why you should want to go there
* Linked data: A Primer - A history of Rome, an outline of its
culture, its language, its food, its customs, and what it is all
about today
* Strategies - Sets of things you ought to be aware of and plan
for before embarking on your adventure, and then itineraries
for immersing yourself and enjoying everything the Eternal
City has to offer once you have arrived
* Details - Lists of where to eat, what to do depending on your
particular interests, and how to get around the city in case you
need help
The Guidebook is a product of the Linked Archival Metadata planning project (LiAM), led by the Digital Collections and Archives at Tufts University and funded by the Institute of Museum and Library Services (IMLS). LiAM’s goals include defining use cases for linked data in archives and providing a roadmap to describe options for archivists intending to share their description using linked data techniques.
What is linked data, and why should I care?
"Tell me about Rome. Why should I go there?"
Linked data is a standardized process for sharing and using information on the World Wide Web. Since the process of linked data is woven into the very fabric of the way the Web operates, it is standardized and will be applicable as long as the Web is applicable. The process of linked data is domain agnostic meaning its scope is equally apropos to archives, businesses, governments, etc. Everybody can and everybody is equally invited to participate. Linked data is application independent. As long as your computer is on the Internet and knows about the World Wide Web, then it can take advantage of linked data.
Linked data is about sharing and using information (not mere data but data put into context). This information takes the form of simple "sentences" which are intended to be literally linked together to communicate knowledge. The form of linked data is similar to the forms of human language, and like human languages, linked data is expressive, nuanced, dynamic, and exact all at once. Because of its atomistic nature, linked data simultaneously simplifies and transcends previous information containers. It reduces the need for profession-specific data structures, but at the same time it does not negate their utility. This makes it easy for you to give your information away, and for you to use other people's information.
The benefits of linked data boil down to two things: 1) it makes information more accessible to both people as well as computers, and 2) it opens the doors to the creation any number of knowledge services limited only by the power of human imagination. Because it standardized, agnostic, independent, and mimics human expression linked data is more universal than many of the current processes of information dissemination. Universality infers decentralization, and decentralization promotes dissemination. On the Internet anybody can say anything at anytime. In the aggregate, this is a good thing and it enables information to be combined in ways yet to be imagined. Publishing information as linked data enables you to seamlessly enhance your own knowledge services as well as simultaneously enhance the knowledge of others.
"Rome is the Eternal City. After visting Rome you will be
better equipped to participate in the global conversation of
the human condition."
Linked data: A Primer
"Okay. My interest is piqued. Please tell me more about Rome's
history, culture, and geography. What are the people like, what
do they do for a living, and how can I get around if I'm going to
stay for a while?".
This section describes linked data in much greater detail. Specifically, this section introduces the reader to the history of linked data, a data model called RDF, RDF "serializations", RDF publishing models, ways RDF has been used, SPARQL,...
History
"Rome rose, fell, and has risen again."
The history of linked data begins with a canonical Scientific American article, "The Semantic Web" by Tim Berners-Lee, James Hendler, and Ora Lassila in 2001. [1] The article described an environment where Internet-wide information was freely available for both people and computers with the ultimate purpose of bringing new knowledge to light. To do this people were expected to: 1) employ a data model called RDF for organizing their information, and 2) express the RDF as XML files on the Web. In a time when the predominate mode of data modeling was rooted in relational databases, the idea of RDF was difficult for many people to understand. Of the people who did understand, many them thought expressing RDF as XML made things too difficult to read. Despite these limitations, there was a flurry of academic research done against the idea of the Semantic Web, but the term "linked data" had yet to be coined.
Around this same time REST-ful computing was being articulated by the Internet community. Simply put, REST-ful computing is a way for computers to request and share information over the World Wide Web. All one usually had to do is submit a very long URL (complete with numerous name/value pairs) to a Web server, and the Web server was then expected to return computer readable data. Many computer programmers and people who could write HTML picked up on the idea quickly. REST-ful computing was seen as immediately practical with little need to learn anything about "data models". Because the ideas behind the Semantic Web may have been oversold and because REST-ful computing was seen as so easy to implement, REST-ful computing flourished (and continues to flourish) while interest in the Semantic Web waned.
Then, in 2006, Tim Berners-Lee concretely described how to make the Semantic Web a reality. [2] In it he listed a four-step process for making linked data freely available on the Web. It is a practical process many people can identify with. The description also advocated for simpler URLs (URLs sans long name/value pairs) for identifying anything in the world -- people, places, or things both real and imaginary. At this same time new ways of expressing RDF were articulated and becoming popular. RDF manifested as XML was no longer the only choice. Also at this same time a few entrepreneurial individuals were beginning to provide software applications and services for creating, maintaining, and distributing RDF. This made the work of some people easier. An increasing number of specialized communities -- governments, scientific communities, professional associations, and "cultural heritage institutions" -- began making their data and metadata freely available on the Web, accessible via REST-ful computing or RDF. These developments, plus a kinship of "all things open" (open source software, open access publishing, open data, etc.) to the ideals of the Semantic Web, probably contributed to the current interest in the newly coined phrase "linked data". Progress and developments in linked data (now rarely called the Semantic Web) continue but at a more measured pace. Linked data communities are strengthening. The ideas behind modeling data as RDF are becoming better understood. Production-level RDF services are being implemented. While the ideas behind RDF, linked data, and ultimately the Semantic Web have yet to become mainstream, interest is now in waxing phase.
What is RDF?
"Rome is inextricably linked to its Roman heritage."
Linked Data is a process for sharing human knowledge on the World Wide Web. It is about asserting relationships between things and then linking these things together to express knowledge. These two things (asserting relationships and linking) are at the very heart of linked data, and they are the defining characteristics of RDF. This section describes RDF in greater detail.
RDF is an acronym for Resource Description Framework. As the name implies, it is a structure (framework) for asserting relationships (descriptions) between things (resources). It is a model for organizing data. Unlike the data model of spreadsheets made up of rows & columns, or the data model of joined tables as in relational databases, the data model of RDF is based on the idea of a triple -- a simple "sentence" with three distinct parts: 1) a subject, 2) a predicate, and 3) an object. The subject of each triple is expected to be a URI (for the time being, think "URL"), and this URI is expected to point to things either real or imaginary. Similarly, the object of each triple is a URI, but it can also be a literal -- meaning a word, phrase, narrative, or number. Predicates take the form of URIs too, and they are intended to denote relationships between the subjects and objects. To extend the analogy of the sentence further, think of subjects and objects as if they were nouns, and think of predicates as if they were verbs.
RDF statements are often illustrated as arced graphs where subjects and objects are nodes in the illustration and predicates are lines connecting the nodes:
[ subject ] --- predicate ---> [ object ]
The "linking" in linked data happens when sets of RDF statements share common URIs. By doing so, the subjects of RDF statements end up having many characteristics, and the objects of URIs point to other subjects in other RDF statements. This linking process transforms independent sets of RDF statements into a web of interconnections, and this is where the Semantic Web gets its name:
/ --- a predicate ---------> [ an object ]
[ subject ] - | --- another predicate ---> [ another object ]
\ --- a third predicate ---> [ a third object ]
|
|
yet another predicate
|
|
\ /
[ yet another object ]
An example is in order. Suppose there is a thing called Rome, and it will be represented with the following URI: http://example.org/rome. We can now begin to describe Rome using triples:
subjects predicates objects
----------------------- ----------------- -------------------------
http://example.org/rome has name "Rome"
http://example.org/rome has founding date "1000 BC"
http://example.org/rome has description "A long long time ago,..."
http://example.org/rome is a type of http://example.org/city
http://example.org/rome is a sub-part of http://example.org/italy
The corresponding arced graph would look like this:
/ --- has name ------------> [ "Rome" ]
| --- has description -----> [ "A long time ago..." ]
[ http://example.org/rome ] - | --- has founding date ---> [ "1000 BC" ]
| --- is a sub-part of ---> [ http://example.org/italy ]
\ --- is a type of --------> [ http://example.org/city ]
In turn, the URI http://example.org/italy might have a number of relationships asserted against it also:
subjects predicates objects
------------------------ ----------------- -------------------------
http://example.org/italy has name "Italy"
http://example.org/italy has founding date "1923 AD"
http://example.org/italy is a type of http://example.org/country
http://example.org/italy is a sub-part of http://example.org/europe
Now suppose there were things called Paris, London, and New York. They can be represented in RDF as well:
subjects predicates objects
-------------------------- ----------------- -------------------------
http://example.org/paris has name "Paris"
http://example.org/paris has founding date "100 BC"
http://example.org/paris has description "You see, there's this tower..."
http://example.org/paris is a type of http://example.org/city
http://example.org/paris is a sub-part of http://example.org/france
http://example.org/london has name "London"
http://example.org/london has description "They drink warm beer here."
http://example.org/london has founding date "100 BC"
http://example.org/london is a type of http://example.org/city
http://example.org/london is a sub-part of http://example.org/england
http://example.org/newyork has founding date "1640 AD"
http://example.org/newyork has name "New York"
http://example.org/newyork has description "It is a place that never sleeps."
http://example.org/newyork is a type of http://example.org/city
http://example.org/newyork is a sub-part of http://example.org/unitedstates
Furthermore, each of "countries" can be have relationships denoted against them:
subjects predicates objects
------------------------------- ----------------- -------------------------
http://example.org/unitedstates has name "United States"
http://example.org/unitedstates has founding date "1776 AD"
http://example.org/unitedstates is a type of http://example.org/country
http://example.org/unitedstates is a sub-part of http://example.org/northamerica
http://example.org/england has name "England"
http://example.org/england has founding date "1066 AD"
http://example.org/england is a type of http://example.org/country
http://example.org/england is a sub-part of http://example.org/europe
http://example.org/france has name "France"
http://example.org/france has founding date "900 AD"
http://example.org/france is a type of http://example.org/country
http://example.org/france is a sub-part of http://example.org/europe
The resulting arced graph of all these triples might look like this:
[INERT ARCED GRAPH HERE.]
From this graph, new information can be inferred as long as one is able to trace connections from one node to another node through one or more arcs. For example, using the arced graph above, questions such as the following can be asked and answered:
* What things are denoted as types of cities, and what are their names?
* What is the oldest city?
* What cities were founded after the year 1 AD?
* What countries are sub-parts of Europe?
* How would you describe Rome?
In summary, RDF is data model -- a method for organizing discrete facts into a coherent information system. The model is built on the idea of triples whose parts are URIs or literals. Through the liberal reuse of URIs in and between sets of triples, questions surrounding the information can be answered and new information can be inferred. RDF is the what of linked data. Everything else (ontologies & vocabularies, URIs, RDF "serializations" like RDF/XML, triple stores, SPARQL, etc.) are the how's. None of them will make any sense unless the reader understands that RDF is about establishing relationships between data for the purposes of sharing information.
Ontologies & vocabularies
"What languages do they speak in Rome?"
If RDF is built on "sentences" called triples, then by analogy ontologies & vocabularies are the "languages" of RDF. This section describes the role of ontologies & vocabularies in linked data.
Linked data is about putting data in the form of RDF on the Web. RDF is a data model for describing things (resources), and it made up of three parts: 1) subjects, 2) predicates, and 3) objects. The things being described are the subjects of RDF triples. They represent the things you own. The combined use of predicates and objects form the descriptions of the resources. These descriptions are akin to a language, or in the parlance of RDF, they are ontologies & vocabularies. While it is perfectly fine to create your own language to describe your own things, it behooves you to use one or more ontologies & vocabularies of others. Otherwise your descriptions will exist in a virtual silo with no interaction with outside resources. When outside ontologies & vocabularies are not employed in RDF, then the purpose of the linking in linked data gets defeated.
RDF ontologies & vocabularies are comprised of classes of objects and properties of those objects. The classes of objects posit the existence of things. They might posit the class of all people, places, events, etc. Properties are characteristics of the classes. People might have names and birth dates. Places have geographic coordinates. Events have dates, times, and descriptions.
There are quite a number of existing ontologies & vocabularies. Some of them of interest to readers of this guide are listed in another section, but a few are briefly discussed here. The first is FOAF (Friend Of A Friend). [3]. This ontology/vocabulary is used to describe people. It defines a number of classes, including but not limited to: agent, person, and document. Agents have properties such as mboxes (email addresses), various identifiers, topics of interest, etc. A person, which is a subclass of Agents inherits all of the properties of an agent, but also has a few of its own such as family names, various types of home pages, and images. Documents have properties such as topics and primary interests. If the resources you were describing were people (or agents), then it might want to draw on the FOAF ontology/vocabulary. If the entity named Rome had an email address, then its RDF arced graph might look like this:
[ http://example.org/rome ] --- has mbox ---> [ mailto:[email protected] ]
Another ontology/vocabulary is DCMI Metadata Terms ("Dublin Core"). [4] It posits the existence of things like: agent, rights statement, standard, and physical resource. It includes properties such as creator, title, description, various types of dates, etc. While the "Dublin Core" standard was originally used to describe bibliographic materials, it has matured and been widely adopted by a growing number of ontologists.
In a project called Linking Lives at the Archives Hub, an ontology/vocabulary for archival description was created. [5, 6] This ontology includes a few classes from FOAF (document, agent, etc.) but also has classes such as repository, archival resource, biographical history, and finding aid. Properties include various dates, extent, title, or has biographical history.
A final example is VIAF (Virtual International Authority File). [7] This "ontology" is more akin to a "controlled vocabulary", and it is a list of people's names and URIs associated with them. VIAF is intended to be used in conjunction with things like DCMI's creator property. For example, if Romulus, one of the mythical twins and founders of Rome were associated with the entity of Rome, then the resulting arced graph might look like this:
[ http://example.org/rome ] --- has creator ---> [ http://viaf.org/viaf/231063554/ ]
There are other controlled vocabularies of interest to the readers of this book, including additional name authority files, subject headings, language codes, country listings, etc. These and other ontologies & vocabularies are listed later in the guidebook.
RDF and linked data is about making relationships between things. These relationships are denoted in the predicates of RDF triples, and the types of relationships are defined in ontologies & vocabularies. These ontologies & vocabularies are sometimes called schema. In the world of bibliography (think "Dublin Core"), these relationship types include things such "has title", "has subject", or "has author". In other ontologies, such as Friend of a Friend (FOAF), there are relationship types such as "has home page", "has email address", or "has name". Obviously there are similarities between things like "has author" and "has name", and consequently there are other ontologies (schemas) allowing equivocation, similarity, or hierarchy to be denoted, specifically RDFS, SKOS, and OWL.
In the world of archives, collections and their items are described. Think metadata. Some of this metadata comes from name authority lists and controlled vocabulary terms. Many of the authority lists and controlled vocabulary terms used by archives exist as linked data. Thus, when implementing RDF in archives it is expected to state things such as "This particular item was authored by this particular URI", or "This particular collection has a subject of this particular URI" where the URIs are values pointing to items in named authority lists or controlled vocabularies.
Probably one of the more difficult intellectual tasks you will have when it comes to making your data and information available as linked data will be the selection of one or more ontologies used to make your RDF. Probably the easiest -- but not the most precise -- way to think about ontologies is as if they were fields in a MARC record or an EAD file. Such an analogy is useful, but not 100% correct. Probably the best way to think of the ontologies is as if they were verbs in a sentence denoting relationships between things — subjects and objects.
But if ontologies are sets of "verbs", then they are akin to human language, and human language is ambiguous. Therein lies the difficulty with ontologies. There is no "right" way to implement them. Instead, there is only best or common practice. There are no hard and fast rules. Everything comes with a bit of interpretation. The application and use of ontologies is very much like the application and use of written language in general. In order for written language to work well two equally important things need to happen. First, the writer needs to be able to write. They need to be able to choose the most appropriate language for their intended audience. Shakespeare is not "right" with his descriptions of love, but instead his descriptions of love (and many other human emotions) resinate with a very large number of people. Second, written language requires the reader to have a particular adeptness as well. Shakespeare can not be expected to write one thing and communicate to everybody. The reader needs to understand English, or the translation from English into another language needs to be compete and accurate.
The Internet, by design, is a decentralized environment. There are very few rules on how it is expected to be used. To a great extent it relies on sets of behavior that are more common practice as opposed to articulated rules. For example, what "rules" exist for tweets on Twitter? What rules exist for Facebook or blog postings. Creating sets of rules will not fly on the Internet because there is no over-arching governing body to enforce any rules. Sure, there are things like Dublin Core with their definitions, but those definitions are left to interpretation, and there are no judges nor courts nor laws determining whether or not any particular application of Dublin Core is "correct". Only the common use of Dublin Core is correct, and its use is not set in stone. There are no "should's" on the Internet. There is only common practice.
With this in mind, it is best for you to work with others both inside and outside your discipline to select one or more ontologies to be used in your linked data. Do not think about this too long nor too hard. It is an never-ending process that is never correct. It is only a process that approximates the best solution.
"The people of Rome speak Italian, mostly. But it is not
difficult to hear other languages as well. Rome is an
international city."
RDF serializations
"While Romans speak Italian, mostly. There are different
dialects, each with its own distinct characteristics."
RDF is a data model, and so far it has only been described (and illustrated) in the abstract. RDF needs to be exchanged between computers, and therefore it needs to be more concretely expressed. There are a number of ways RDF can be expressed, and these expressions are called "serializations".
RDF (Resource Description Framework) is a conceptual data model made up of "sentences" called triples — subjects, predicates, and objects. Subjects are expected to be URIs. Objects are expected to be URIs or string literals (think words, phrases, or numbers). Predicates are "verbs" establishing relationships between the subjects and the objects. Each triple is intended to denote a specific fact.
When the idea of the Semantic Web was first articulated XML was the predominant data structure of the time. It was seen as a way to encapsulate data that was both readable by humans as well as computers. Like any data structure, XML has both its advantages as well as disadvantages. On one hand it is easy to determine whether or not XML files are well-formed, meaning they are syntactically correct. Given a DTD, or better yet, an XML schema, it is also easy to determine whether or not an XML file is valid — meaning does it contain the necessary XML elements, attributes, and are they arranged and used in the agreed upon manner. XML also lends itself to transformations into other plain text documents through the generic, platform-independent, XSLT (Extensible Stylesheet Language Transformation) process. Consequently, RDF was originally manifested — made real and "serialized" — though the use of RDF/XML.
The example of RDF at the beginning of the Guidebook was an RDF/XML serialization:
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Declaration_of_Independence">
<dcterms:creator>
<foaf:Person rdf:about="http://id.loc.gov/authorities/names/n79089957">
<foaf:gender>male</foaf:gender>
</foaf:Person>
</dcterms:creator>
</rdf:Description>
</rdf:RDF>
On the other hand, XML, almost by definition, is verbose. Element names are expected to be human-readable and meaningful, not obtuse nor opaque. The judicious use of special characters (&, <, >, ", and ') as well as entities only adds to the difficulty of actually reading XML. Consequently, almost from the very beginning people thought RDF/XML was not the best way to express RDF, and since then a number of other syntaxes — serializations — have manifested themselves.
Below is the same RDF serialized in a format called Notation 3 (N3), which is very human readable, but not extraordinarily structured enough for computer processing. It incorporates the use of a line-based data structure called N-Triples used to denote the triples themselves:
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix dcterms: <http://purl.org/dc/terms/>.
<http://en.wikipedia.org/wiki/Declaration_of_Independence> dcterms:creator <http://id.loc.gov/authorities/names/n79089957>.
<http://id.loc.gov/authorities/names/n79089957> a foaf:Person;
foaf:gender "male".
JSON (JavaScript Object Notation) is a popular data structure inherent to the use of JavaScript and Web browsers, and RDF can be expressed in a JSON format as well:
{
"http://en.wikipedia.org/wiki/Declaration_of_Independence": {
"http://purl.org/dc/terms/creator": [
{
"type": "uri",
"value": "http://id.loc.gov/authorities/names/n79089957"
}
]
},
"http://id.loc.gov/authorities/names/n79089957": {
"http://xmlns.com/foaf/0.1/gender": [
{
"type": "literal",
"value": "male"
}
],
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [
{
"type": "uri",
"value": "http://xmlns.com/foaf/0.1/Person"
}
]
}
}
Just about the newest RDF serialization is an embellishment of JSON called JSON-LD. Compare & contrasts the serialization below to the one above:
{
"@graph": [
{
"@id": "http://en.wikipedia.org/wiki/Declaration_of_Independence",
"http://purl.org/dc/terms/creator": {
"@id": "http://id.loc.gov/authorities/names/n79089957"
}
},
{
"@id": "http://id.loc.gov/authorities/names/n79089957",
"@type": "http://xmlns.com/foaf/0.1/Person",
"http://xmlns.com/foaf/0.1/gender": "male"
}
]
}
RDFa represents a way of expressing RDF embedded in HTML, and here is such an expression:
<div xmlns="http://www.w3.org/1999/xhtml"
prefix="
foaf: http://xmlns.com/foaf/0.1/
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
dcterms: http://purl.org/dc/terms/
rdfs: http://www.w3.org/2000/01/rdf-schema#"
>
<div typeof="rdfs:Resource" about="http://en.wikipedia.org/wiki/Declaration_of_Independence">
<div rel="dcterms:creator">
<div typeof="foaf:Person" about="http://id.loc.gov/authorities/names/n79089957">
<div property="foaf:gender" content="male"></div>
</div>
</div>
</div>
</div>
The purpose of publishing linked data is to make RDF triples easily accessible. This does not necessarily mean the transformation of EAD or MARC into RDF/XML, but rather making accessible the statements of RDF within the context of the reader. In this case, the reader may be a human or some sort of computer program. Each serialization has its own strengths and weaknesses. Ideally an archive will have figure out ways exploit each of the RDF serializations for specific publishing purposes.
For a good time, play with the RDF Translator which will convert one RDF serialization into another. [8]
The RDF serialization process also highlights how data structures are moving away from a document-centric models to a statement-central models. This too has consequences for way cultural heritage institutions, like archives, think about exposing their metadata, but that is the topic of another essay.
Publishing linked data
"Rome is a place of many neighborhoods, each with its own
flavor, but in each of the neighborhoods you will experience an
essence of what it means to be Roman."
Just as there are many ontologies & vocabularies, just as there are many ways to express RDF, there are many ways to publish RDF. This section introduces many of them, and they are discussed in greater detail later in the book.
Once your data has been modeled as RDF, it will be time to make it accessible on the Web. There are many ways to do this, and each of the rely on how your data/information is locally stored. Your data/information is probably already modeled in some sort of store, and it will be the job of somebody to transform that data into RDF. For example, the store might be a simple spreadsheet -- a flat file or rows & columns of information. In this model, each row in the spreadsheet will need to have a URI "minted" for it. That URI will then have predicates denoting the names of each column. Finally, the object of each RDF triple will be the value of each item in the row. The result will be set of RDF in the form of some serialization. This can be put on the Web directly, or saved in a triple store, described below.
If the data/information is modeled in a relational database, then the process of manifesting it as RDF is very similar to the process of transforming a flat file for rows & columns. There are two other options as well. One is the use of R2RML, a relational database to RDF modeling language. [9] Another option is to use a piece of open source software called D2RQ which will convert a well-designed relational database into RDF. [10]. In either case, the resulting RDF could be saved as one or more files on a Web server or shared dynamically directly from the database.
Many people use some sort of document to model their data/information. In the world of archives, this document may be an EAD file or a set of MARC records. While not ideal, it is possible to use XSL to transform the EAD or MARC into a RDF serializations. This is the process employed by the Guidebook's "sandbox" application. This is the same process used by a number of more formal projects including Linking Lives and ReLoad. [5, 11].
If RDF serializations are saved as files on the Web, as in the examples outlined above, then those files may be simple dumps of data which are large files consisting of huge numbers of triples. The URLs of those files are then listed in some sort of Web page intended for people to read. Sometimes the locations of these RDF files are listed in a different type of file called a VOID file. [12]
Alternatively, RDF might be published side-by-side their human readable HTML counter-parts. The VIAF URIs are good examples. If this method is employed, then a process called "content negotiation" is expected to be implemented. Content negotiation is an integral part of the way the World Wide Web operates. It is method for one computer to request a specific form of data/information from a server. In terms of linked data, a computer can request the content of a URI in the form of HTML or RDF, and the remote server will provide the information in the desired format, if available. Content negotiation is described in greater detail later in the Guidebook.
Some people's data/information may be modeled in a content management system, like Drupal. These systems generate HTML on-the-fly. In these cases it is a good idea embed RDF into the HTML using RDFa. [13] RDFa is also an option when generating HTML out of databases on-the-fly.
Finally, there exist "databases" specifically designed to store RDF triples. These "databases" are called "triple stores". Along side these triple stores are methods for searching their contents. This searching mechanism is called SPARQL, and SPARQL "endpoints" may be available on the Web. A number of triple stores are lists later in the Guidebook, and a SPARQL tutorial is available as well.
In summary, there are many ways of publishing linked data. At first glance, it may seem as if there are too many choice and difficult to implement, but in reality modeling your data as RDF is much more challenging.
"Rome has many places to eat. Food is available in cafes, open
air markets, family owned grocery stores, and commercial
conglomerates. There are may ways to get food to the consumer."
Linked open data
"Rome has long understood the benefits of common areas, hence freely accessible squares, fountains, and market places."
Some people make a distinction between linked data and linked open data (LOD). Linked data is what has been described so far in this guidebook. It is a process for making data and metadata available on the Web using RDF as the underlying data model and incorporating into it as many links (URIs) as possible. This process does not necessarily stipulate any intellectual property claims on the published data and metadata, but linked data only works if it is made available with all copyright waived. This is often called the Creative Commons Zero (CC0) license. (Intellectual property claims can be explicitly stated through the use of VOID files -- sets of triples using the VOID ontology and published along side RDF files.) For the most part data and metadata accessible via linked data is assumed to free, as in gratis and without financial obligation. At the same time, consumers of linked data are expected to acknowledge the time and effort others have spent in the creation of the consumed data, and consumers are not expected to call the data their own. This is an unspoken courtesy.
LOD is linked data that is explicitly denoted as free, as in gratis and without financial obligation. The idea seems to have been born from the idea of all things open (open source software and open access publishing). There is a strong community of LOD advocates in libraries, archives, and museums. The community is called LODLAM and has sponsored a few summits. [14] The Open Knowledge Foundation is also a ver strong advocate for LOD. [15]
Strictly speaking linked data is a process, and linked open data is a thing. For all intents and purposes, this guidebook assumes the data and information made accessible via linked data is linked open data, but there might come a time in the future when access controls are places against some linked data. The data and information of linked data is not necessarily provided gratis.
Consuming linked data
"Rome, like any other city, is full of give and take."
Publishing linked data is only half of the equation. Consuming linked data is the other half. Without considering the consuming part of linked data, it is not possible to reap all of the benefits linked data has to offer.
No primer on linked data would be complete without the inclusion of the famous LOD cloud diagram, below:
[INSERT LOD CLOUD DIAGRAM HERE.]
While the diagram has not been updated in a few years, it represents the sets of published linked data available. It also illustrates how they relate to each other. As you can see DBedia is at the center of the cloud and illustrates how other data sets rely on it for (some) data and information. Many, if not all, of these sets of RDF have been registered in a directory called Datahub. [16] There one can search and browse for all sorts of data sets access via linked data standards as well as other protocols. Datahub is a good place to look for complementary data sets to your own.
It is not possible for anybody to completely describe any collection of data. All data and information is inextricably linked to other data and information. There is not enough time nor energy in the world for any individual nor discrete group to do all the work. By harnessing the collective power of the Web (and linked data), it is possible to create more complete collections and more thorough description. These more complete collections and more complete descriptions can be created in two different ways. The first has already been described -- through the judicious use of shared URIs. The second is by harvesting linked data from other locations, combining it with your own, and producing value-added services against the result. Two excellent examples come to mind:
* LIBRIS (http://libris.kb.se) - This is the joint catalog of the
Swedish academic and research libraries. Search results are presented
in HTML, but the URLs pointing to individual items are really
actionable URIs resolvable via content negotiation, thus support
distribution of bibliographic information as RDF.
* ReLoad (http://labs.regesta.com/progettoReload/en) - This is a
collaboration between the Central State Archive of Italy, the Cultural
Heritage Institute of Emilia Romagna Region, and Regesta.exe. It is
the aggregation of EAD files from a number of archives which have been
transformed into RDF and made available as linked data. Its purpose
and intent are very similar to the the purpose and intent of the
combined LOCAH Project and Linking Lives.
A number of other example projects are listed in one of the Guidebooks's appendices.
About linked data, a review
"About Rome, a review"
Many guidebooks about linked data and RDF start out with an elaboration of Tim Berner-Lee's guidelines, but providing it now, in this part of the guide, may be more useful now that some of underlying principles have been described.
In "Linked Data -- Design Issues" Berners-Lee outlined four often-quoted expectations for implementing the Semantic Web. [2] Each of these expectations are listed below along with some elaborations:
* "Use URIs as names for things" - URIs (Universal Resource
Identifiers) are unique identifiers, and they are expected to
have the same shape as URLs (Universal Resource Locators). These
identifiers are expected to represent things such as people,
places, institutions, concepts, books, etc. URIs are monikers or
handles for real world or imaginary objects.
* "Use HTTP URIs so that people can look up those names." - The
URIs are expected to look and ideally function on the World Wide
Web through the Hypertext Transfer Protocol (HTTP), meaning the
URI's point to things on Web servers.
* "When someone looks up a URI, provide useful information, using
the standards (RDF, SPARQL)" - When URIs are sent to Web servers
by Web browsers (or "user-agents" in HTTP parlance), the response
from the server should be in a conventional, computer readable
format. This format is usually a "serialization" of RDF (Resource
Description Framework) -- a notation looking much like a
rudimentary sentence composed of a subject, predicate, and
object.
* "Include links to other URIs. So that they can discover more
things." - Simply put, try very hard to use URIs other people
have have used. This way the relationships you create can
literally be linked to the relationships other people have
created. These links may represent new knowledge.
In the same text Berners-Lee also outlined a sort of reward system -- a sets of stars -- for levels of implementation. This reward system also works very well as a strategy for publishing linked data by cultural heritage institutions such as archives. A person gets:
* one star for making data available on the web (in whatever
format) but with an open license
* two stars for making the data machine-readable and structured
data (e.g. Excel instead of an image scan of a table)
* three stars for making the data available in a
non-proprietary format (e.g. comma-separated values instead of
Excel)
* four stars for using open standards from W3C (RDF and SPARQL)
to identify things, so that people can point at your stuff
* five stars for linking your data to other people's data to
provide context
Implementing linked data represents a different, more modern way of accomplishing some of the same goals of archival science. It is a process of making more people aware of your archival description. It is not the only way to make more people aware, but it represents a way that will be wide spread, thorough, and complete.
Linked data, or more recently referred to as "linked open data", is a proposed technique for generating new knowledge. It is intended to be a synergy between people and sets of agreed upon computer systems that when combined will enable both people and computers to discover and build relationships between seemingly disparate data and information to create and discover new knowledge.
In a nutshell, this is how it works. People possess data and information. They encode that data and information in any number of formats easily readable by computers. They then make the encoded data and information available on the Web. Computers are then employed to systematically harvested the encoded data. Since the data is easily readable, the computers store the data locally and look for similarly encoded things in other locally stored data sets. When similar items are identified relationships can be inferred between the items as well as the other items in the data set. To people, some of these relationships may seem obvious and "old hat". On the other hand, since the data sets can be massive, relationships that were never observed previously may come to light, thus new knowledge is created.
Some of this knowledge may be trivial. For example, there might be a data set of places -- places from all over the world including things like geographic coordinates, histories of the places, images, etc. There might be another data set of poeple. Each person may be described using their name, their place of birth, and a short biography. These data sets may contain ten's of thousands of items each. Using linked data it would be possible to cross reference the people with the places to discover who might have met whom when and where. Some people may have similar ideas, and those ideas may have been generated in a particular place. Linked data may help in discovering who was in the same place at the same time and the researcher may be better able to figure out how a particular idea came to fruition.
The amount of data and information accessible today is greater in size than it has ever been in human history. Using our traditional techniques of reading, re-reading, writing, discussing, etc. is more than possible to learn new things about the state of the world, the universe, and the human condition. By exploiting the current state of computer technology is possible to expand upon our traditional techniques and possibly accelerate the mass of knowledge.
When you hear of linked data and the Semantic Web, the next thing you often hear is "RDF" or "Resource Description Framework". First and foremost, RDF is a way of representing information -- a data model. It does this through the use of assertions (think, "sentences") with only three parts: 1) a subject, 2) a predicate, and 3) an object. Put together, these three things create things called "triples".
RDF is not to be confused with RDF/XML or any other type of RDF "serialization". Remember, RDF describes triples, but it does not specify how the triples are express or written down. On the other hand, RDF/XML is an XML syntax for expressing RDF. Some people think RDF/XML is too complicated and too verbose. Consequently, other serializations have manifested themselves including N3 and Turtle.
Linked data is a process. It is a process of making information easily accessible and inextricably tied to the way the Web works. By making information available as linked data, you will more easily make your available to others and vice versa. The result will be value-added services and increased visibility.
Benefits
Archives are about collecting, organizing, preserving, and disseminating original, unique, and primary literature. These are the whats of archival practice, but the hows of archival practice evolve with the changing technology. With the advent of ubiquitous networked computing, people's expectations regarding access to information and knowledge have changed significantly. Unless institutions like archives change with the times, then the needs previously filled by archives will be filled by other institutions. Linked data is a how of archival practice, and it is one of those changes behooving archives to adopt. It is a standards-based technique for making data and information available on the Web. It is rooted in the very fabric of the Web and therefore is not beholden to any particular constituency. It is a long lasting standard and practice that will last as long as the hypertext transfer protocol is operational.
Making archival descriptions and collection available via linked data will increase the use of those descriptions and collections. It is a form of benign advertising. Commercial search engines will harvest the linked data and make it available it their search engines. Search engines will return hits to your descriptions and collections driving traffic to you and your site. Digital humanists will harvest your linked data, perform analysis against it, and create new knowledge or bring hidden knowledge to light. Computer scientist will collect your data, amalgamate it with the data of others, and discover relationship previously unconceived.
You can divide your combined collections and services into two tangible parts: 1) the collections themselves, and 2) the metadata describing them. It is usually possible to digitize your collections, but the result is rarely 100% satisfactory. Digitization is almost always a useful surrogate not a complete replacement. In this way, your collections as physical objects will always be a draw to all types of learners and researchers. The metadata, on the other hand, is 100% digitizable, and therefore lends itself very well to dissemination on the Internet. Linked data represents one way to make this happen.
Few archival collections are 100% complete. There are always pieces missing, and some of those missing pieced will be owned by others. Your collections will have relationship with other collection, but you will not have direct access to those other collections. Some of these relationships are explicit. Some of them are implicit. If everybody were to expose their metadata then those explicit and implicit relationships can become more apparent. Once these relationships are strengthened and become more obvious, interest in the collections will increase accordingly, and the collections will be used to a greater degree. With this increased use will come increased attention, and in turn, a greater measure of success for the collections and services it provides.
"Rome is a large city that keeps getting larger. It is built on
rich traditions, and the city continues to evolve. The language
of the Romans expressive, and people speak with more than just
words. They also speak with their hands. There are many ways to
enjoy Rome. Different areas will appeal to different people, but
to really understand Rome as a whole, a person need to visit and
appreciate each of them in turn."
Links
[1] canonical article - http://csis.pace.edu/~marchese/CS835/Lec9/112_SemWeb.pdf
[2] design issues - http://www.w3.org/DesignIssues/LinkedData.html
[3] FOAF - http://www.foaf-project.org
[4] DCMI Metadata Terms - http://dublincore.org/documents/dcmi-terms/
[5] Linking Lives project - http://archiveshub.ac.uk/linkinglives/
[6] Linking Lives ontology - http://data.archiveshub.ac.uk/def/
[7] VIAF - http://viaf.org
[8] RDF Translator - http://rdf-translator.appspot.com
[9] R2RML - http://www.w3.org/TR/r2rml/
[10] D2RQ - http://d2rq.org/
[11] ReLoad - http://labs.regesta.com/progettoReload/en
[12] VOID - http://www.w3.org/TR/void/
[13] RDFa - http://www.w3.org/TR/xhtml-rdfa-primer/
[14] LODLAM - http://lodlam.net
[15] Open Knowledge Foundation - http://okfn.org
[16] Datahub - http://datahub.io
Strategies for putting linked data into practice for the archivist
"If you to go to Rome for a day, then walk to the Colosseum and
Vatican City. Everything you see along the way will be extra. If
you to go to Rome for a few days, do everything you would do in a
single day, eat and drink in a few cafes, see a few fountains,
and go to a museum of your choice. For a week, do everything you
would do in a few days, and make one or two day-trips outside
Rome in order to get a flavor of the wider community. If you can
afford two weeks, then do everything you would do in a week, and
in addition befriend somebody in the hopes of establishing a
life-long relationship."
When you read a guidebook about Rome -- or any travel guidebook -- there are simply too many listed things to see & do. Nobody can see all the sites, visit all the museums, walk all the tours, nor eat at all the restaurants. It is literally impossible to experience everything a place like Rome has to offer. So it is with linked data. Despite this fact, if you were to do everything linked data had to offer, then you would do all of things on the following list starting at the first item, going all the way down to evaluation, and repeating the process over and over:
1. design the structure your URIs
2. select/design your ontology & vocabularies -- model your data
3. map and/or migrate your existing data to RDF
4. publish your RDF as linked data
5. create a linked data application
6. harvest other people's data and create another application
7. evaluate
8. repeat
Given that it is quite possible you do not plan to immediately dive head-first into linked data, you might begin by getting your feet wet or dabbling in a bit of experimentation. That being the case, here are a number of different "itineraries" for linked data implementation. Think of them as strategies. They are ordered from least costly and most modest to greatest expense and completest execution:
1. Rome in a day - Maybe you can't afford to do anything right now, but if you have gotten this far in the guidebook, then you know something about linked data. Discuss (evaluate) linked data with with your colleagues, and consider revisiting the topic a year.
2. Rome in three days - If you want something relatively quick and easy, but with the understanding that your implementation will not be complete, begin migrating your existing data to RDF. Use XSLT to transform your MARC or EAD files into RDF serializations, and publish them on the Web. Use something like OAI2RDF to make your OAI repositories (if you have them) available as linked data. Use something like D2RQ to make your archival description stored in databases accessible as linked data. Create a triple store and implement a SPARQL endpoint. As before, discuss linked data with your colleagues.
3. Rome in week - Begin publishing RDF, but at the same time think hard about and document the structure of your future RDF's URIs as well as the ontologies & vocabularies you are going to use. Discuss it with your colleagues. Migrate and re-publish your existing data as RDF using the documentation as a guide. Re-implement your SPARQL endpoint. Discuss linked data not only with your colleagues but with people outside archival practice.
4. Rome in two weeks - First, do everything you would do in one week. Second, supplement your triple store with the RDF of others'. Third, write applications against the triple store that goes beyond search. These services will be of two types: services for curating the collection information, and services for using the collection. The former services are primarily for archivists and subject speciallists. The later services are primarily intended for everybody from the general public to the academic scholar. These services will be akin to the telling of stories, and you will be discussing linked data with the world, literally.
Rome in a day
"If you to go to Rome for a day, then walk to the Colosseum and
Vatican City. Everything you see along the way will be extra."
Linked data is not a fad. It is not a trend. It makes a lot of computing sense, and it is a modern way of fulfilling some the goals of archival practice. Just like Rome, it is not going away. An understanding of what linked data has to offer is akin to experiencing Rome first hand. Both will ultimately broaden your perspective. Consequently it is a good idea to make a concerted effort to learn about linked data, as well as visit Rome at least once. Once you have returned from your trip, discuss what you learned with your friends, neighbors, and colleagues. The result will be enlightening everybody.
The previous sections of this book described what linked data is and why it is important. The balance of book describes more of the how's of linked data. For example, there is a glossary to help reenforce your knowledge of the jargon. You can learn about HTTP "content negotiation" to understand how actionable URIs can return HTML or RDF depending on the way you instruct remote HTTP servers. RDF stands for "Resource Description Framework", and the "resources" are represented by URIs. A later section of the book describes ways to design the URIs of your resources. Learn how you can transform existing metadata records like MARC or EAD into RDF/XML, and then learn how to put the RDF/XML on the Web. Learn how to exploit your existing databases (such as the one's under Archon, Archivist's Toolkit, or ArchiveSpace) to generate RDF. If you are the Do It Yourself type, then play with and explore the guidebook's tool section. Get the gentlest of introductions to searching RDF using a query language called SPARQL. Learn how to read and evaluate ontologies & vocabularies. They are manifested as XML files, and they are easily readable and visualizable using a number of programs. Read about and explore applications using RDF as the underlying data model. There are a growing number of them. The book includes a complete publishing system written in Perl, and if you approach the code of the publishing system as if it were a theatrical play, then the "scripts" read liked scenes. (Think of the scripts as if they were a type of poetry, and they will come to life. Most of the "scenes" are less than a page long. The poetry even includes a number of refrains. Think of the publishing system as if it were a one act play.) If you want to read more, and you desire a vetted list of books and articles, then a later section lists a set of further reading.
After you have spent some time learning a bit more about linked data, discuss what you have learned with your colleagues. There are many different aspects of linked data publishing, such as but not limited to:
* allocating time and money
* analyzing the RDF of yours as well as others
* articulating policies
* cleaning and improving RDF
* collecting and harvesting the RDF of others
* deciding what ontologies & vocabularies to use
* designing local URIs
* enhancing RDF triples stores by asserting additional relationships
* finding and identifying URIs for the purposes of linking
* making RDF available on the Web (SPARQL, RDFa, data dumps, etc.)
* project management
* provisioning value-added services against RDF (catalogs, finding aids, etc.)
* storing RDF in triple stores
In archival practice, each of these things would be done by different sets of people: archivists & content specialists, administrators & managers, computer programers & systems administrators, metadata experts & catalogers. Each of these sets of people have a piece of the publishing puzzle and something significant to contribute to the work. Each of these sets of people plays a key and indispensable role in linked data publishing:
* archivists & content specialists - These are the people who understand the "aboutness" of a particular collection. These are the people who understand and can thoroughly articulate the significance of a collection. They know how and why particular things belong in a collection. They are able to answer questions about the collection as all as tell stories against it.
* administrators & managers - These are "resource allocators". They are people who manage time and money. They are people who establish priorities on an institutional level. They are expected to be financially aware and politically savvy. These people will have a view of the wider environment, have their finger on the pulse of where the local institution is moving, and be able to juggle seemingly conflicting directives. More than anybody else, they are expected to be able to outline a plan and see it to fruition.
* metadata specialists & catalogers - These are people who understand data about data. Not only do they understand the principles of controlled vocabularies and authority lists, but they are also familiar with a wide variety of such lists, specifically as they are represented on the Web. In linked data there are fewer descriptive cataloging "rules". Nevertheless, the way the ontologies of linked data can be used need to be interpreted, and this interpretation needs to be consistent. Metadata specialists understand these principles.
* computer programers & systems administrators - Not only are these the people who have a fundamental understanding of what computer can and cannot do, but they also know how to put this understanding into practice. At the very least, the computer technologists need to understand a myriad of data structures and how to convert them into different data structures. Converting MARC 21 into MARCXML. Transforming EAD into HTML. Reporting against a relational database to create serialized RDF. These tasks required computer programming skills, but not necessarily any one in particular. Any modern programming language (Java, PHP, Python, Ruby, etc.) includes the necessary function to complete the tasks.
Read about linked data. Learn about linked data. Bring these sets of people together discuss what you have learned. With this in mind, articulate some goals — broad targets of things you would like to accomplish. Some of them might include:
* making your archival collections more widely accessible
* working with others to build virtual collections of like topics or formats
* incorporating your archival descriptions into public spaces like Wikipedia
* integrating your collections into local teaching, learning, and research activities
* increasing the awareness of your archive to benefactors
* increasing the computer technology skills of fellow archivists
The what of your objectives are not so much identified with nouns as they are action verbs, such as: write, evaluate, implement, examine, purchase, hire, prioritize, list, delete, acquire, discuss, share, find, compare & contrast, stop, start, complete, continue, describe, edit, updated, create, purchase, upgrade, etc. The what of your objective is in the doing.
After discussing these sorts of issues, at the very least you will have a better collective understanding of the possibilities. If you don't plan to "go to Rome" right away, you might decide to reconsider the "vacation" at another time.
"Even Michelangelo, when he painted the Sisten Chapel, worked
with a team of people each possessing a complementary set of
skills. Each had something different to offer, and the
discussion between themselves was key to their success."
Rome in three days
"If you to go to Rome for a few days, do everything you would do
in a single day, eat and drink in a few cafes, see a few
fountains, and go to a museum of your choice."
Linked data in archival practice is not new. Others have been here previously. You can benefit from their experience and begin publishing linked data right now using tools with which you are probably already familiar. For example, you probably have EAD files, sets of MARC records, or metadata saved in database applications. Using existing tools, you can transform this content into RDF and put the result on the Web, thus publishing your information as linked data.
At its very root, linked data is about making your data available for others to harvest and use. While the "killer linked data application" has seemingly not reared its head, this does not mean you ought not make your data available at linked data. You won't see the benefits immediately, but sooner or later (less than 5 years from now), you will see your content creeping into the search results of Internet indexes, into the work of both computational humanists and scientists, and into the hands of esoteric hackers creating one-off applications. Internet search engines will create "knowledge graphs", and they will include links to your content. The humanists and scientists will operate on your data similarly. Both will create visualizations illustrating trends. They will both quantifiably analyze your content looking for patterns and anomalies. Both will probably create network diagrams demonstrating the flow and interconnection of knowledge and ideas through time and space. The humanist might do all this in order to bring history to life or demonstrate how one writer influenced another. The scientist might study ways to efficiently store your data, easily move it around the Internet, or connect it with data set created by their apparatus. The hacker (those are the good guys) will create flashy-looking applications that many will think are weird and useless, but the applications will demonstrate how the technology can be exploited. These applications will inspire others, be here one day and gone the next, and over time, become more useful and sophisticated.
EAD
If you have used EAD to describe your collections, then you can easily make your descriptions available as valid linked data, but the result will be less than optimal. This is true not for a lack of technology but rather from the inherent purpose and structure of EAD files.
A few years ago an organisation in the United Kingdom called the Archive's Hub was funded by a granting agency called JISC to explore the publishing of archival descriptions as linked data. The project was called LOCAH. [1] One of the outcomes of this effort was the creation of an XSL stylesheet (ead2rdf) transforming EAD into RDF/XML. [2] The terms used in the stylesheet originate from quite a number of standardized, widely accepted ontologies, and with only the tiniest bit configuration / customization the stylesheet can transform a generic EAD file into valid RDF/XML for use by anybody. The resulting XML files can then be made available on a Web server or incorporated into a triple store. This goes a long way to publishing archival descriptions as linked data. The only additional things needed are a transformation of EAD into HTML and the configuration of a Web server to do content negotiation between the XML and HTML.
For the smaller archive with only a few hundred EAD files whose content does not change very quickly, this is a simple, feasible, and practical solution to publishing archival descriptions as linked data. With the exception of doing some content negotiation, this solution does not require any computer technology that is not already being used in archives, and it only requires a few small tweaks to a given workflow:
1. implement a content negotiation solution
2. create and maintain EAD file
s
3. transform EAD into RDF/XML
4. transform EAD into HTML
5. save the resulting XML and HTML files on a Web server
6. go to step #2
EAD is a combination of narrative description and a hierarchal inventory list, and this data structure does not lend itself very well to the triples of linked data. For example, EAD headers are full of controlled vocabularies terms but there is no way to link these terms with specific inventory items. This is because the vocabulary terms are expected to describe the collection as a whole, not individual things. This problem could be overcome if each individual component of the EAD were associated with controlled vocabulary terms, but this would significantly increase the amount of work needed to create the EAD files in the first place.
The common practice of using literals to denote the names of people, places, and things in EAD files would also need to be changed in order to fully realize the vision of linked data. Specifically, it would be necessary for archivists to supplement their EAD files with commonly used URIs denoting subject headings and named authorities. These URIs could be inserted into id attributes throughout an EAD file, and the resulting RDF would be more linkable, but the labor to do so would increase, especially since many of the named items will not exist in standardized authority lists.
Despite these short comings, transforming EAD files into some sort of serialized RDF goes a long way towards publishing archival descriptions as linked data. This particular process is a good beginning and outputs valid information, just information that is not as linkable as possible. This process lends itself to iterative improvements, and outputting something is better than outputting nothing. But this particular proces is not for everybody. The archive whose content changes quickly, the archive with copious numbers of collections, or the archive wishing to publish the most complete linked data possible will probably not want to use EAD files as the root of their publishing system. Instead some sort of database application is probably the best solution.
MARC
In some ways MARC lends it self very well to being published via linked data, but in the long run it is not really a feasible data structure.
Converting MARC into serialized RDF through XSLT is at least a two step process. The first step is to convert MARC into MARCXML and then MARCXML into MODS. This can be done with any number of scripting languages and toolboxes. The second step is to use a stylesheet such as the one created by Stefano Mazzocchi to transform the MODS into RDF/XML -- mods2rdf [3] From there a person could save the resulting XML files on a Web server, enhance access via content negotiation, and called it linked data.
Unfortunately, this particular approach has a number of drawbacks. First and foremost, the MARC format had no place to denote URIs; MARC records are made up almost entirely of literals. Sure, URIs can be constructed from various control numbers, but things like authors, titles, subject headings, and added entries will most certainly be literals ("Mark Twain", "Adventures of Huckleberry Finn", "Bildungsroman", or "Samuel Clemans"), not URIs. This issue can be overcome if the MARCXML were first converted into MODS and URIs were inserted into id or xlink attributes of bibliographic elements, but this is extra work. If an archive were to take this approach, then it would also behoove them to use MODS as their data structure of choice, not MARC. Continually converting from MARC to MARCXML to MODS would be expensive in terms of time. Moreover, with each new conversion the URIs from previous iterations would need to be re-created.
EAC-CPF
Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) goes a long way to implementing a named authority database that could be linked from archival descriptions. [4] These XML files could easily be transformed into serialized RDF and therefore linked data. The resulting URIs could then be incorporated into archival descriptions making the descriptions richer and more complete. For example the FindAndConnect site in Australia uses EAC-CPF under the hood to disseminate information about people in its collection. [5] Similarly, "SNAC aims to not only make the [EAC-CPF] records more easily discovered and accessed but also, and at the same time, build an unprecedented resource that provides access to the socio-historical contexts (which includes people, families, and corporate bodies) in which the records were created" [6] More than a thousand EAC-CPF records are available from the RAMP project. [7]
METS, MODS, OAI-PMH service providers, and perhaps more
If you have archival descriptions in either of the METS or MODS formats, then transforming them into RDF is as far away as your XSLT processor and a content negotiation implementation. As of this writing there do not seem to be any METS to RDF stylesheets, but there are a couple stylesheets for MODS. The biggest issue with these sorts of implementations are the URIs. It will be necessary for archivists to include URIs into as many MODS id or xlink attributes as possible. The same thing holds true for METS files except the id attribute is not designed to hold pointers to external sites.
Some archives and libraries use a content management system called ContentDM. [8] Whether they know it or not, ContentDM comes complete with an OAI-PMH (Open Archives Initiative - Protocol for Metadata Harvesting) interface. This means you can send a REST-ful URL to ContentDM, and you will get back an XML stream of metadata describing digital objects. Some of the digital objects in ContentDM (or any other OAI-PMH service provider) may be something worth exposing as linked data, and this can easily be done with a system called oai2lod. [9] It is a particular implementation of D2RQ, described below, and works quite well. Download application. Feed oai2lod the "home page" of the OAI-PMH service provider, and oai2load will publish the OAI-PMH metadata as linked open data. This is another quick & dirty way to get started with linked data.
Databases
Publishing linked data through XML transformation is functional but not optimal. Publishing linked data from a database comes closer to the ideal but requires a greater amount of technical computer infrastructure and expertise.
Databases -- specifically, relational databases -- are the current best practice for organizing data. As you may or may not know, relational databases are made up of many tables of data joined together with keys. For example, a book may be assigned a unique identifier. The book has many characteristics such as a title, number of pages, size, descriptive note, etc. Some of the characteristics are shared by other books, like authors and subjects. In a relational database these shared characteristics would be saved in additional tables, and they would be joined to a specific book through the use of unique identifiers (keys). Given this sort of data structure, reports can be created from the database describing its content. Similarly, queries can be applied against the database to uncover relationships that may not be apparent at first glance or buried in reports. The power of relational databases lies in the use of keys to make relationships between rows in one table and rows in other tables. The downside of relational databases as a data model is infinite variety of fields/table combinations making them difficult to share across the Web.
Not coincidently, relational database technology is very much the way linked data is expected to be implemented. In the linked data world, the subjects of triples are URIs (think database keys). Each URI is associated with one or more predicates (think the characteristics in the book example). Each triple then has an object, and these objects take the form of literals or other URIs. In the book example, the object could be "Adventures Of Huckleberry Finn" or a URI pointing to Mark Twain. The reports of relational databases are analogous to RDF serializations, and SQL (the relational database query language) is analogous to SPARQL, the query language of RDF triple stores. Because of the close similarity between well-designed relational databases and linked data principles, the publishing of linked data directly from relational databases makes whole lot of sense, but the process requires the combined time and skills of a number of different people: content specialists, database designers, and computer programmers. Consequently, the process of publishing linked data from relational databases may be optimal, but it is more expensive.
Thankfully, many archivists probably use some sort of behind the scenes database to manage their collections and create their finding aids. Moreover, archivists probably use one of three or four tools for this purpose: Archivist's Toolkit, Archon, ArchivesSpace, or PastPerfect. Each of these systems have a relational database at their heart. Reports could be written against the underlying databases to generate serialized RDF and thus begin the process of publishing linked data. Doing this from scratch would be difficult, as well as inefficient because many people would be starting out with the same database structure but creating a multitude of varying outputs. Consequently, there are two alternatives. The first is to use a generic database application to RDF publishing platform called D2RQ. The second is for the community to join together and create a holistic RDF publishing system based on the database(s) used in archives.
D2RQ is a very powerful software system. [10] It is supported, well-documented, executable on just about any computing platform, open source, focused, functional, and at the same time does not try to be all things to all people. Using D2RQ it is more than possible to quickly and easily publish a well-designed relational database as RDF. The process is relatively simple:
* download the software
* use a command-line utility to map the database
structure to a configuration file
* edit the configuration file to meet your needs
* run the D2RQ server using the configuration file
as input thus allowing people or RDF user-agents
to search and browse the database using linked
data principles
* alternatively, dump the contents of the database
to an RDF serialization and ingest the result
into your favorite RDF triple store
The downside of D2RQ is its generic nature. It will create an RDF ontology whose terms correspond to the names of database fields. These field names do not map to widely accepted ontologies & vocabularies and therefore will not interact well with communities outside the ones using a specific database structure. Still, the use of D2RQ is quick, easy, and accurate.
Triple stores and SPARQL endpoints
Publishing your RDF as static files is relatively easy, but in order to really take advantage of your efforts you will want to additionally save your RDF in a triple store. There are many open source and commercial applications from which to choose. They all come with a set of specialized features, but they all work essentially the same way. Install. Configure. Run application to ingest RDF. Use different functions to edit and report on the contents of the triple store. Comparing & contrasting each of the available triple stores is beyond the scope of this document, but at the very least, look for one that provides access via a SPARQL endpoint. SPARQL is the query language of RDF triple stores, but the query language goes beyond the generation of lists. It can also be used to assert new RDF statement and have them saved to the store. It provides the means for answering Boolean questions or performing mathematical functions. Consequently, SPARQL can be used to not only search for information but also create new information and answer real-world collections.
Simply publishing RDF as static files is only half of the problem to be solved. By saving your RDF in a triple store and providing access to it via a SPARQL endpoint you open yourself up to possibilities that go beyond the dissemination of finding aids and description. Creating triple stores of your RDF (as well as the RDF of others) enable to tell more compelling stories and succinctly answer given questions.
Talk with your colleagues
Do not keep your candle hidden beneath a bushel basket. Tell your colleagues about your accomplishments. Write about your experience and share it via mailing lists, blog posting, conference presentations, formally published articles, and book. Through the writing process you will become more aware of both your successes as well as your failures. Others will see what you have done and offer comments. Some will follow your lead. Others will take your lead and go further. In any case, the dialog about linked data -- in terms of both its strengths & weaknesses -- will grow louder. Such dialog can only be a good thing. More useful techniques will be come best practices, and less useful techniques will fade from existence. Dialoge builds community, and communities can become very strong. A particular proverb comes to mind. "If you want to go fast, then go alone. If you want to go far, take your friends." Work in the sphere of linked data is exactly like this. Manifesting your archival description by transforming EAD into RDF is all well and good. It is fast & easy, but if you want to truly take advantage of the "linking" in linked data, then you will need to talk with your colleagues -- both in and out of archives -- in order to take things to another level.
"If you are going to be in Rome for only a few days, you will
want to see the major sites, and you will want to adventure out &
about a bit, but at the same time is will be a wise idea to
follow the lead of somebody who has been there previously. Take
the advise of these people. It is an efficient way to see some of
the sights."
Rome in week
"For a week, do everything you would do in a few days, and make
one or two day-trips outside Rome in order to get a flavor of the
wider community."
In this "itinerary" you begin publishing RDF, but at the same time you actively design and document URIs as well as select ontologies & vocabularies for your RDF. You then re-publish your existing data using your documentation as a guide. And you re-implement your SPARQL endpoint. Discuss linked data not only with your colleagues but with people outside archival practice. This particular strategy is probably the most difficult it requires a lot of re-thinking about archival practice and ways it can be manifested. To these ends, there are three books of particular interest to the reader. Think of them as required reading:
1. Linked data: Evolving the Web into a global data space by Tom Heath and Christian Bizer [11] - This book provides a thorough and up-to-date overview of what linked data is and why it is important. It is full of useful examples and recipes for implementation.
2. Semantic Web for the working ontologist: Effective modeling in RDFS and OWL by Dean Allemang and James Hendler [12] - This book is primarily intended for people who are evaluating and designing ontologies & vocabularies. It compares & contrasts RDF to alternative data models (like spreadsheets or relational databases), and explains why RDF is particularly suitable to the Web. It then outlines how to create one's own ontology.
3. Linked data patterns: A Pattern catalogue for modeling, publishing, and consuming linked data by Leigh Dodds and Ian Davis [13] - As a "pattern catalog", this book is a long list of common problem experienced by people implementing linked data as well as accompanying discussion and solutions to the problems. The problems and solutions are divided into a number of categories: URIs (identifiers), modeling, publishing, data management, and applications.
Design your URIs
The heart of linked data is the description of things. Things can be anything, either real or imaginary, and all of these things are expected to be represented by URIs. Consequently, when you are intending to publish linked data, you need to think about the structure of the identifiers of your things. To large degree URIs are expected to be "cool". [14] This means they should not change and they should be:
* simple - short, human-readable without any name/value pairs,
and easily depicted in written form
* stable - immutable, something that lasts two, twenty, or two
hundred years, sans implementation bits such as file name
extensions because file types (JPEG, PDF, etc.) change over time
* manageable - meaning you have some sort of system for dealing
with implement changes behind the scenes; content negotiation
systems and HTTP server management systems are keys to success
Pete Johnson, while designing the URIs for the "things" in LOCAH, articulated the following pattern for the URIs he was going to mint [15]:
http://{domain}/id/{concept}/{reference}
Examples then included:
* http://example.org/id/findingaid/mums059
* http://example.org/id/family/clarkfamily
* http://example.org/id/repository/usUS-DLC
These URIs were derived from the data inside EAD files and took into account:
* local identifiers - URIs based embedded identifiers
* authority controlled values - building from the names in
remote lists
* locally scoped names - building from the names in local lists
* locally developed "rules" - identifiers based on long strings
of text
* identifier inheritance - URIs starting with local identifiers
and building on hierarchy
These principles are not very much different from some of the Identifier Patterns articulated by Dodds and Davis. Patterns included:
* hierarchical URIs - these are "patterned" URIs denoting sub and
super-class relationships
* literal keys - based on custom properties, such as a sub-class
if Dublin Core identifiers
* natural keys - URIs built from things like ISSN numbers, OCLC
numbers, database keys, etc.
* patterned URIs - designed after predicable schemes, much like
Johnson's examples
* proxy URIs - URIs created after some alignment is done against
amalgamated RDF
* rebased URIs - a renamed URI done so because the previous one
was not as "cool" as it could be
* shared keys - identifiers created across domains
* URL slugs - patterned URIs based based on strings
Designing your URIs is a bit of a chicken & egg problem. You are designing URIs for the "things" of your archival description, and before you can create the URIs you need to know what "things" you have, which is the subject of the next section. On the other hand, the subjects and objects of RDF statements are expected to be URIs -- the identifiers you will be sharing -- the very heart of linked data. In other words, you need both URIs as well as ontologies & vocabularies at once and at the same time to do the work of linked data. They go hand-in-hand. You can not do one without the other.
At the very least remember one thing, design your URIs to be "cool".
Select your ontology & vocabularies
Articulate and implement best practices for publishing RDF - Work with your friends to articulate and document an "application profile". As guidelines and best practices get articulated, implement them by going back to Step #1. In the meantime, continue on to Step #4.
Republish your RDF
The second alternative requires community effort and coordination. The databases of Archivist's Toolkit, Archon, ArchivesSpace, or Past Perfect could be assumed. The community could then get together and decide on an RDF ontology to use for archival descriptions. The database structure(s) could then be mapped to this ontology. Next, programs could be written against the database(s) to create serialized RDF thus beginning the process of publishing linked data. Once that was complete, the archival community would need to come together again to ensure it uses as many shared URIs as possible thus creating the most functional sets of linked data. This second alternative requires a significant amount of community involvement and wide-spread education. It represents a never-ending process.
Publish your RDF -2. Publish RDF - No matter what kind of RDF you are able to create, make it available on the Web 1. Do simple publishing - At its very root, linked data is about making your data available for others to harvest and use. While the "killer linked data application" has seemingly not reared its head, this does not mean you ought not make your data available at linked data. You won't see the benefits immediately, but sooner or later (less than 5 years from now), you will see your content creeping into the search results of Internet indexes, into the work of both computational humanists and scientists, and into the hands of esoteric hackers creating one-off applications. Internet search engines will create "knowledge graphs", and they will include links to your content. The humanists and scientists will operate on your data similarly. Both will create visualizations illustrating trends. They will both quantifiably analyze your content looking for patterns and anomalies. Both will probably create network diagrams demonstrating the flow and interconnection of knowledge and ideas through time and space. The humanist might do all this in order to bring history to life or demonstrate how one writer influenced another. The scientist might study ways to efficiently store your data, easily move it around the Internet, or connect it with data set created by their apparatus. The hacker (those are the good guys) will create flashy-looking applications that many will think are weird and useless, but the applications will demonstrate how the technology can be exploited. These applications will inspire others, be here one day and gone the next, and over time, become more useful and sophisticated.
Discuss linked data with people outside archival practice
===========================
Rome in two weeks
"If you can afford two weeks, then do everything you would do in
a week, and in addition befriend somebody in the hopes of
establishing a life-long relationship."
Now that you have significantly practiced with the principles of publishing linked data, it is time to harvest the RDF of others so you can enhance your archival services. To do this you will first continue to do everything in the previous sections. You will then supplement and enhance your triple store with the RDF and information of others'. Third, you will write applications against the triple store that go beyond search. These services will be of two types: 1) services for curating the collection of information, and 2) services for using the collection. The former services are primarily for archivists and metadata specialists. The later are primarily intended for everybody from the general public to the scholar. These services will be akin to the telling of stories, and once implemented you will be discussing linked data with the world, literally.
Supplement and enhance your triple store
RDF is published in many different ways such as but not limited to: as triple store dumps, via content negotiation on the other side of actionable URIs, embedded in HTML files as RDFa, available as the result of SPARQL queries, etc. Other metadata and information is available from other means such at REST-ful websites, OAI-PMH data providers, spreadsheets and databases from communities of interest. The information and metadata from each of these "places" can be identified and prioritized into some sort of collection development policy. For example, you might be interested in augmenting your RDF with the RDF of other archival collections. You might want to supplement your descriptions with images of people, places, and things. The places in your archival description may be enhanced with geographic coordinates. Controlled vocabularies in your description may be equivocated or related to other controlled vocabularies enabling the interlinking across collections. What ever the reason, the harvesting and collecting of other people's metadata and information can only enrich the information services you provide.
Begin the harvesting process by taking advantages of directories of RDF and metadata sets. Datahub is a good example. [16] Dumps of RDF could be mirrored locally and then ingested into the triple store. HTML pages could be crawled, RDFa extracted, and the result ingested into the triple store. Lists of actionable URIs could be created by searching remote websites. These actionable URIs could then be fed to a computer program which will use content negotiation to harvest the remote RDF. The actual harvesting process will almost definitely require the skills of a computer programmer or systems administrator. After remote metadata and RDF has been incorporated into your local triple store, you will want to continue to implement your local SPARQL endpoint, because you will definitely use SPARQL to implement the triple store curation services as well as the services for the general public and scholar.
Curating the collection
As additional information is brought into your triple store, there will be an increasing need to curate the information. This process is very similar, if not congruent with the processes behind a project of a few years ago called the National Science Foundation Digital Library (NSDL). [17] This project was intended to harvest data from across the Internet, amalgamate it, and provide services against the result. These services often included creating subsets of the data and search services on top of them. This curation process is similar to the collection / services processes undertaken by Europe's Europeana project and the Digital Public Library of America.
Once the metadata and information have been acquired there is a need to make it the "best" information possible. What does this mean? In this case, the word "best" connotes information that is: 1) consistent, 2) correct, 3) accurate, 4) complete, and 5) timely. The processes of maintaining metadata and information are just as ongoing and indeterminate as the curation of physical collection.
Creating and maintaining metadata is a never-ending process. The items being described can always use elaboration. Collections may increase is size. Rights applied against content may change. Things become digitized, or digitized things are migrated from one format to another. Because of these sorts of things and many others, cleanup, conversion, and consistency are something every metadata specialist needs to keep in mind. They are things to be managed and maintained as the linked data of others is assimilated into your own collection. These issues have been nicely elaborated upon in separate articles by Dianne Hillman and Thomas Johnson. [18, 19]. Hillman outlines a number of problems:
1. missing data - metadata elements not present in supplied
metadata
2. incorrect data - metadata values not conforming to standard
element use
3. confusing data - multiple values crammed into a single
metadata element, embedded HTML tags, etc.
4. insufficient data - no indication of controlled vocabularies
used
Johnson similar things:
1. removing "noise" - removing named labels with no values (empty
elements), removing statements with non-information whose values
are things like "unknown" or "n/a", or values whose content is
only punctuation
2. normalizing presentation - removing extraneous white spaces,
removing HTML double encodings, normalizing the order of first
names and last names, etc.
3. assigning URIs to curation objects - identifying URIs for
string values, assigning URIs to digitized objects
4. mapping legacy elements to linked data vocabularies - for
example, transforming any elements with values like "245" to some
form of title
Linked data affords an additional type of enhancement -- enhancements of relationship and augmentation. RDF is made up of ontologies. Statements will be asserted and brought together in similar collection. One set of statements may use FOAF. Another may use Dublin Core. These two ontologies include the concept of names of people. You might want to make assertions in your RDF equivocating the names in FOAF with the names in Dublin Core. Similar things may be done with identifiers where URIs of this type are intended to be equivalent with the URIs of another type. For example, the URI denoting Mark Twain in VIAF may be equivalent to the URI RDF collection. By denoting equivalence between these two items additional information can be brought to bear regarding Mark Twain.
Cleanup, conversion, and consistency mean many things. Does all of your metadata use the same set of one or more vocabularies? Are things spelled correctly? Maybe you used abbreviations in one document but spelled things out in another? Have you migrated your JPEG images to JPEG2000 or TIFF formats? Maybe the EAD DTD has been updated, and you want (need) to migrate your finding aids from one XML format to another? Do all of your finding aids exhibit the same level of detail; are some "thinner" than others? Have you used one form of a person's name in one document but used another form in a different document? The answers to these sorts of questions point to the need for cleanup, conversion, and consistency.
Is your archival description LOD-ready? Now? The simple, straight-forward answer is, "Yes." The longer and more complicated answer is, "No. Your data is never 100% linked data ready because the process of archival description is never finished."
Applications & use-cases
In the previous "tours of Rome", the creation of a simple search engine against your triple store was suggested. In those "itineraries" other applications could have been created, but now that additional metadata and information have been brought to light it will be behoove you to go beyond the simple search engine and begin to tell stories. If not, then the services you are providing aren't very much dissimilar to a search against Google. Here are a few application and use-case ideas.
Create a union catalog. This is really an enhancement of the simple search idea. If you make your data available as linked data, and if you find at least one other archive who is making their data available as linked data, then you can find a third somebody who will combine them into a triple store and implement a rudimentary SPARQL interface against the union. Once this is done a researcher could conceivably search the interface for a URI to see what is in both collections. The absolute imperative key to success for this to work is the judicious inclusion of URIs in both data sets. This scenario becomes even more enticing with the inclusion of two additional things. First, the more collections in the triple store the better. You can not have enough collections in the store. Second, the scenario will be even more enticing when each archive publishes their data using similar ontologies as everybody else. Success does not hinge on similar ontologies, but success is significantly enhanced. Just like the relational databases of today, nobody will be expected to query them using their native query language (SQL or SPARQL). Instead the interfaces will be much more user-friendly. The properties of classes in ontologies will become facets for searching and browsing. Free text as well as fielded searching via drop-down menus will become available. As time goes on and things mature, the output from these interfaces will be increasingly informative, easy-to-read, and computable. This means the output will answer questions, be visually appealing, as well as be available in one or more formats for other computer programs to operate upon.
Tell a story. You and your hosting institution(s) have something significant to offer. It is not just about you and your archive but also about libraries, museums, the local municipality, etc. As a whole you are a local geographic entity. You represent something significant with a story to tell. Combine your linked data with the linked data of others in your immediate area. The ontologies will be a total hodgepodge, at least at first. Now provide a search engine against the result. Maybe you begin with local libraries or museums. If you work in an academic setting, then maybe you begin with other academic departments across campus. Allow people to search the interface and bring together the content of everybody involved. Do not just provide lists of links in search results, but instead create knowledge graphs. Supplement the output of search results with the linked data from Wikipedia, Flickr, etc. In a federated search sort of way, supplement the output with content from other data feeds such as (licensed) bibliographic indexes or content harvested from OAI-PMH repositories. Identify complementary content from further afield. Figure out a way for you and they to work together to create a newer, more complete set of content. Creating these sorts of things on-the-fly will be challenging. On the other hand, you might implement something that is more iterative and less immediate, but more thorough and curated if you were to select a topic or theme of interest, and do your own searching and story telling. The result would be something that is at once a Web page, a document designed for printing, or something importable into another computer program.
Create new knowledge. Create an inference engine, turn it against your triple store, and look for relationships between distinct sets of URIs that weren't previously apparent. Here's one way how:
1. allow the reader to select an actionable URI of personal
interest, ideally a URI from the set of URIs you curate
2. submit it an HTTP server or SPARQL endpoint and request RDF as
output
3. save the output to a local store
4. for each subject and object URI found the output, go to
Step #2
5. go to step #2 n times for each newly harvested URI in the store
where n is a reader-defined integer greater than 1; in other
words, harvest more and more URIs, predicates, and literals
based on the previously harvested URIs
6. create a set of human readable services / reports against the
content of the store, and think of these services / reports akin to
finding aids, reference materials, or museum exhibits of the
future: Example services / reports might include:
* hierarchal lists of all classes and properties - This
would be a sort of semantic map. Each item on the map
would be clickable allowing the reader to read more and
drill down.
* text mining reports - collect into a single "bag of
words" all the literals saved in the store and create:
word clouds, alphabetical lists, concordances,
bibliographies, directories, gazetteers, tabulations of
parts of speech, named entities, sentiment analyses,
topic models, etc.
* maps - use place names and geographic coordinates to
implement a geographic information service
* audio-visual mash-ups - bring together all the media
information and create things like slideshows, movies,
analyses of colors, shapes, patterns, etc.
* search interfaces - implement a search interface
against the result, SPARQL or otherwise
* facts - remember SPARQL queries can return more than
just lists. They can return mathematical results such
as sums, ratios, standard deviations, etc. It can also
return Boolean values helpful in answering yes/no
questions. You could have a set of canned fact queries
such as, how many ontologies are represented in the
store. Is the number of ontologies greater than 3? Are
there more than 100 names represented in this set? The
count of languages used in the set, etc.
7. Allow the reader to identify a new URI of personal interest,
specifically one garnered from the reports generated in Step #5.
8. Go to Step #2, but this time have the inference engine be more
selective by having it try to crawl back to your namespace and
set of locally curated URIs.
9. Return to the reader the URIs identified in Step #7, and by
consequence, these URIs ought to share some of the same
characteristics as the very first URI; you have implemented a
"find more like this one" tool. You, as curator of the collection
of URIs might have thought the relations between the first URI
and set of final URIs was obvious, but those relationships would
not necessarily be obvious to the reader, and therefore new
knowledge would have been created or brought to light.
10. If there are no new URIs from Step #7, then go to Step #6
using the newly harvested content.
11. Done - if a system were created such as the one above, then
the reader would quite likely have acquired some new knowledge,
and this would be especially true the greater the size of n in
Step #5.
Expand beyond the document centric finding aid model. The archival finding aid (specifically the EAD file) is essentially a document with two parts: 1) a narrative story describing a collection, and 2) an inventory of items in the collection. Yes, these finding aids are manifested as XML files, and therefore they are well-structured and computer-readable, but they do not really take advantage of the "webbed" environment. Finding aids impose a conceptual model on a collection from the point of view of the archivist. This in an of itself is not a bad thing and it serves many purposes. At the same time, finding aids make it difficult to assert alternative conceptual models on the same collection. Like the MARC records of librarianship, EAD files are an electronic form of a print data model. MARC is to catalog cards as EAD is to finding aids. Each serves a particular purpose, but neither exploit nor really take advantage of computers connected through HTTP. Migrating archival description from document-centric finding aids to the more atomistic RDF makes it easier for many models to be asserted. It also overcomes a number of other limitations. The following use-cases are gleaned from the original LiAM proposal and can be addressed by manifesting archival description as linked data:
* An individual record or series of records often have
simultaneous significance in multiple contexts. How do you
signify simultaneous records when they might have multiple
contexts? --Anne Sauer
* Record creators have multifaceted relationships to different
records and series of records. How do you model record creators
with different series of records? --Anne Sauer
* Documentation of a function often spans provenance-based