Watch this: Webified markup

Terrence A.Brooks

Information School, University of Washington, Seattle, WA 98195

Webify your metadata!

"Webify" signals a cultural shift: stop thinking that you can control the future use of your Web content ('my Web document is a finished product'), and start building Web resources designed to facilitate ad hoc, unpredictable future uses ('my Web document is merely input to somebody else's process').

The challenge has been to design metadata that is easy for the non-technical scholar to construct, is expressive enough to declare the provenance and relationship of resources; and perhaps most importantly, reflects standard information architecture protocols (i.e., XML) so that conventional application programs (i.e., there are many XML editors) can harvest the metadata and re-purpose your Web content.

Eprints DC XML

XML is extensible, platform-independent, and supports internationalization by being fully Unicode compliant. The fact that XML is a text-based format means that when the need arises, one can read and edit XML documents using standard text-editing tools. This has led to the widespread adoption of XML as the lingua franca of information interchange. Dare Obasanjo

eprints (These are your digital resources.)
DC (Eprints DC XML grows out of the legacy Dublin Core set of fifteen descriptive elements)
XML (The Extensible Markup Language architecture leverages all the application software built on the protocols of the W3C. Suddenly we can apply our desktop XML tools to create, harvest and manipulate these metadata. Here's a thought: given that Microsoft Office uses XML as the fundamental file format, why couldn't someone write an application that would permit users of MS Word to save their document as XML with eprints DC XML metadata?

Example: I'm helping a colleague in the Department of English and Comparative Literature to create a digital version of the 600 year-old poem Piers Plowman. His ambition is to publish his re-edited version of the poem as a dozen XML documents available to the Web community for downloading, re-purposing, or integrating into their own repositories or applications.

The challenge of identity: Open Web scholarship, however, has an identity problem. The XML Piers Plowman fragments will be scattered everywhere: somebody might e-mail one of them to you; you might snag one of them with Google. Their metadata must announce the identity of the fragment, its provenance, its relationship with other fragments, and guide the reader from a part to the whole.

This identity problem is a deep problem. No only at the level of content can there be term collisions (e.g.: You use Piers in the context of a poem, somebody else uses Piers in the context of mooring a boat), but at the deeper level of structure (e.g., a title element of a poem may not be the same as the title of a nobleman). This means that Webified metadata must make links to explanatory structures that disambiguate both content and structure: namespaces.

Background

Eprints DC XML is emerging from a JISC (Joint Information Systems Committee) group focused on improving search in United Kingdom repositories. To facilitate search across repositories of scholarly documents, however, the group first had to address the problem of metadata. Irregularities in content and structure make harvesting metadata difficult.

Consider some of the metadata of this document that you're reading right now:

<meta name="DC.Creator" content="Brooks, Terrence A." />

This architecture is inherently ambiguous (i.e., what's the last name, first name, etc.?), awkward to harvest (i.e., should I use string methods that assume that a comma will always terminate the last name?) and tricky to match to other records that might record my name as T.A. Brooks", etc. First-generation Web metadata like this was easy to create, but difficult both conceptually and mechanically to harvest and re-purpose. Cole and Foulonneau (2007: 118) state the importance of using metadata content standards in the practical application of OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting).

The implications for the present example is to structure the metadata as XML and use content that comes from the Library of Congress Name Authority File. The following fragment of Eprints DC XML clearly indicates the given and family names.

<epdcx:statement 		
	    epdcx:propertyURI="http://xmlns.com/foaf/0.1/givenname">
	        <epdcx:valueString>
	            Terrence
	        </epdcx:valueString>
</epdcx:statement>
<epdcx:statement  
	    epdcx:propertyURI="http://xmlns.com/foaf/0.1/familyname">
	        <epdcx:valueString>
	            Brooks

	        </epdcx:valueString>
</epdcx:statement>

Catch-up reading to do right now:

Allinson, Johnston and Powell (January 2007) describe the development of metadata for eprints in recognition of the inadequacy of legacy Dublic Core for "repository developers and aggregator services."

Allinson, Johnston and Powell (23-26 January 2007) describe how the DC XML model can accommodate FRBR-based (Functional Requirements for Bibliographic Records ) model.

Allinson and Powell (2006) summarize the Eprints application profile

Baker (2007) discusses integating Web resources automatically and gives syntax examples.

What was the situation for web metadata up to this moment? Not good. Simple Dublin Core elements could be used, but as the example representing my name illustrates, the results could be ambiguous. You could embed Dublin Core elements in HTML or XHTML documents, but these were awkward to process as substitutes for XML documents. RDF (Resource Description Framework) was designed to be built on top of structures such as Dublin Core (Powers, 2003). And OWL (Ontology Web Language), designed for describing ontologies, was designed to be built on top of RDF. A house of cards built on an inadequate foundation?

A first example

Unfortunately, eprints DC XML still looks forbidding to the non-technical! The following fragment shows just the title element of the prologue of Piers Plowman.

<?xml version="1.0"?>
<pp:section xmlns:pp="http://faculty.washington.edu/miceal/piersPlowman/">

<!-- Description Set Element -->
<epdcx:descriptionSet 
  xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/">
  <epdcx:description>
    <epdcx:statement epdcx:propertyURI="http://purl.org/dc/elements/1.1/title">
      <epdcx:valueString>Prologue of Piers Plowman</epdcx:valueString>

    </epdcx:statement>
  </epdcx:description>
</epdcx:descriptionSet>

[More eprints DC XML elements would fit here]

<!-- Poem -->
<pp:line>

[Lines of the poem would fit here]

Say hello to namespaces

Namespaces address the identity problem of Web scholarship. The seeming complexity of this fragment illustrates one of the realities of sharing information with the world: How do you disambiguate the word 'section' from other uses of this word, define what a 'descriptionSet" is and so on?

xmlns:pp="http://faculty.washington.edu/miceal/piersPlowman/" declares the namespace for the poem's structure, a resource that explans the poem's architecture such as section and line. My colleague would be responsible for this namespace.

xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/" declares the namespace for all the XML elements with the prefix epdcx [i.e., eprints DC XML] such as descriptionSet. Eprints DC XML is responsible for this namespace.

epdcx:propertyURI="http://purl.org/dc/elements/1.1/title" while this is not a namespace declaration, it serves to disambiguate what a title statement is. Dublin Core is responsible for this namespace.

Manipulating Eprint DC XML

The advantage of eprints DC XML is ease of manipulation with desktop XML tools. Styling of XML document Here is a screen grab from a styling to create an HTML document of the XML fragment above. [The styling document is an XSLT document - Extensible Stylesheet Language Transformations]. This example illustrates that we can create an HTML document by reaching into the XML document and finding the metadata record of the title of the poem. In short, it's not difficult to target Eprints DC XML metadata.

Desktop XML development tools such as XML Spy and Stylus Studio combine the XML fragment with the following stylesheet to produce an HTML document. The stylesheet looks like this:

<?xml version='1.0'?>
<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:pp="http://faculty.washington.edu/miceal/piersPlowman/" 
xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/">

<xsl:template match="/">
	<html>
	<h1>The title</h1>
	<p>

	<xsl:value-of select="pp:section/epdcx:descriptionSet/epdcx:description/_
_epdcx:statement/epdcx:valueString/."></xsl:value-of>		
	</p>
	</html>
</xsl:template>
</xsl:stylesheet>

For readers unable to appreciate this code, let me assert that this example illustrates re-use with standard XML editing tools. Future scholars can use desktop XML editors to style diverse new documents from the XML Piers Plowman poem documents.

The same effect can be achieved programmatically using C# and VS.Net. Here is the C# code to select the title:

XmlDocument doc = new XmlDocument();
doc.Load("titleExample.xml");
XPathNavigator nav;
nav = doc.CreateNavigator();
XPathExpression myXpath = nav.Compile("pp:section/epdcx:descriptionSet/_
     _epdcx:description/epdcx:statement/epdcx:valueString/.");
XmlNamespaceManager man = new XmlNamespaceManager(nav.NameTable);
man.AddNamespace("pp", "http://faculty.washington.edu/miceal/piersPlowman/");
man.AddNamespace("epdcx", "http://purl.org/eprint/epdcx/2006-11-16/"); 
myXpath.SetContext(man);
XPathNodeIterator i = nav.Select(myXPath);
while (i.MoveNext())
{
    Console.WriteLine(i.Current.Value.ToString());
}

For readers unable to appreciate this code, let me assert that the XML poem fragments can be manipulated programmatically. This means that their metadata and contents can be harvested automatically.

Who will use Eprints DC XML?

The answer is: not everybody, because at this time it is still too technical and there is an uncertain pay-off for the extra labour of creating it.
But there are many Web content providers creating open Web repositories designed for re-use and sharing. Examples: (1) Scholars: I put my papers in open Web space; I want readers to download them to their own repositories; I will use eprints DC XML to ease this process; (2) Scholarly journals online: Information Research (i.e., this journal) puts scholarly papers in open Web space... the reasoning is the same; (3) My school, my university, etc.

Watch for this:

Scholars will establish their own namespaces to define elements of information architecture.
Scholars will use their homepages as disambiguation resources.
Editing tools will be developed that make the creation of metadata easy and a natural part of creating a new document.

Eat my own dog food!

There is nothing so salutary as advocating something and then having to practise what one preaches. The following is candidate Eprints DC XML metadata for the 'Watch this' column you're reading right now. It includes explanatory comments that would not be normally included.

<epdcx:descriptionSet xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/">
<!-- First description -->
<epdcx:description 
	<epdcx:statement 
	    epdcx:propertyURI="http://purl.org/dc/elements/1.1/type"
		<!-- Description of a person -->
	    epdcx:valueURI="http://purl.org/eprint/entityType/Person"/>

	<epdcx:statement 
		<!-- First name -->
	    epdcx:propertyURI="http://xmlns.com/foaf/0.1/givenname">
	        <epdcx:valueString>
	            Terrence
	        </epdcx:valueString>
	</epdcx:statement>

	<epdcx:statement  
		<!-- Family name -->
	    epdcx:propertyURI="http://xmlns.com/foaf/0.1/familyname">
	        <epdcx:valueString>
	            Brooks
	        </epdcx:valueString>
	</epdcx:statement>

	<epdcx:statement 
		<!-- The homepage identifies which "Terrence Brooks" this is -->
	    epdcx:propertyURI="http://xmlns.com/foaf/0.1/homepage"
	    epdcx:valueURI="http://faculty.washington.edu/tabrooks/">
</epdcx:description>
<!-- Second description -->
<epdcx:description 
	<!-- Example address of this resource -->
    epdcx:resourceURI="http://http://informationr.net/ir/" >

    <epdcx:statement 
      epdcx:propertyURI="http://purl.org/dc/elements/1.1/type" 
	  <!-- A scholarly work -->
      epdcx:valueURI="http://purl.org/eprint/entityType/ScholarlyWork" /> 
    <epdcx:statement 
	  <!-- Title of this column -->
      epdcx:propertyURI="http://purl.org/dc/elements/1.1/title" >
         <epdcx:valueString>
		    Watch this: Webified metadata

         </epdcx:valueString>
    </epdcx:statement> 
    <epdcx:statement
	  <!-- Abstract of this column -->
      epdcx:propertyURI="http://purl.org/dc/terms/abstract" >
      <epdcx:valueString>

Eprints DC XML metadata represent webified metadata.

      </epdcx:valueString>
    </epdcx:statement> 
    <epdcx:statement
	  <!-- LCSH subject heading -->
      epdcx:propertyURI="http://purl.org/dc/elements/1.1/subject"
      epdcx:vesURI="http://purl.org/dc/terms/LCSH" >
      <epdcx:valueString>
          Metadata 
      </epdcx:valueString>

    </epdcx:statement>    
</epdcx:description>     
</epdcx:descriptionSet>

What did the dog food taste like?

There are several aspects of Eprints DC XML that I struggled with. For example, I avoided the Creator element which suggests that I construct my name as Brooks, Terrence. Instead, I constructed my name as an Agent and explicitly recorded as family name and given name. This explicit notation is much preferable from the point of view of post processing.

What is obvious, however, is that if you are going to place information resources on the open Web, then it is worth the while to adorn them with webified metadata that facilitates harvesting and re-use.

Implementation notes

Suppose you have an open archives composed of many XHTML documents, each of which has been adorned with Eprints DC XML metadata and the time has come to harvest metadata because you wish to build an index of your holdings, or display some subset of the documents on a webpage. Now you have to roll up your sleeves and tangle with angle brackets and the other minutiae of varying protocols. The big picture: you're going to treat XHTML documents as XML documents and describe a pathway down into the document to pluck out certain pieces of information.

Here are some strategy notes:

1. The documents must be valid XHTML. Such documents are well formed, a necessary requirement if you are going to treat them as XML documents.

2. Add your (well-formed) Eprints DC XML metadata to the head of these XHTML documents. Of course, once you do this, you can no longer validate your documents because Eprints DC XML is not part of the XHTML protocol. You have, in effect, hybridized your documents.

3. To satisfy an XML editor, which will process your documents, you will have to change the DOCTYPE and HTML declarations to reflect the document's morphing into an XML document. The XHTML declarations

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

should be changed to

<?xml version="1.0" ?>
<html>

4. Add the epdcx namespace to your XSLT stylesheet. This is required because you will be composing an Xpath query into your Eprints DC XML metadata:

xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/"

5. Now I can open this Watch This column with the XML editor, Stylus Studio Professional edition 2006, and retrieve Brooks with this Xpath:

<xsl:value-of select="/html/head/epdcx:descriptionSet/epdcx:description_
_/epdcx:statement[@epdcx:propertyURI = 'http://xmlns.com/foaf/0.1/familyname']_
_/epdcx:valueString"></xsl:value-of>

and I can retrieve the title of the column with this Xpath:

<xsl:value-of select="/html/body//h1"></xsl:value-of>

What have I accomplished? I have opened up my documents and exposed their contents for re-use. Webified metadata facilitates the mechanical processing of my documents. Use of authorized content from Name Authorities and Subject Authorities facilitates the linking of documents in my repository to the documents in other repositories. My documents can now be the input to future, ad hoc re-use such as the request: give me all the [sub-unit of the article is ...] abstracts of all the articles written by [given name is ...] Terrence [family name is ...] Brooks that have been indexed as [LCSH subject heading is ...] metadata.

Date: August, 2007

For further reading

Baker, T.(2007). Data, metadata, and the language of interoperability. Retrieved 3 August, 2007 from https://indico.desy.de/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=261

Allinson, J., Johnston, P. and Powell, A. ( 2007). A Dublin Core application profile for scholarly works. Ariadne, No. 50. Retrieved 3 August, 2007 from http://www.ariadne.ac.uk/issue50/allinson-et-al/

Allinson, J., Johnston, P. and Powell, A. (2007). The Eprints application profile: a FRBR approach to modelling repository metadata. Paper delivered to the 2nd International Conference on Open Repositories San Antonio, Texas, USA 23-26 January 2007. Retrieved 3 August, 2007 from http://openrepositories.org/2007/program/files/6/allinson.pdf

Allinson, J., and Powell, A. (2006) Eprints application profile. Paper delivered to Open Scholarship 2006, University of Glasgow, Wednesday Oct 18th. Retrieved 3 August, 2007 from http://www.ukoln.ac.uk/ukoln/staff/j.allinson/eprints-ap-openscholarship.pdf

Cole, T.W. &: Foulonneau, Muriel. (2007) Using the open archives initiative protocol for metadata harvesting. Westport, CT: Libraries unlimited.

Eprints DC XML wiki at http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_DC_XML

Powers, S. (2003). Practical RDF. Sebastopol, CA: O'Reilly.

Find other papers on this subject

Web Counter

Vol.12 No. 4, October 2007 Contents \| Author index \| Subject index \| Search \| Home