Embeded Metadata Please

I hate to say it, but a rich (and expensive) set of metadata, carefully ingested into a Fedora Repository will not save you. If that is all you are doing, your repository is doomed. Unfortunately, it appears that many repositories are in just that situation.

Recently, at the 2009 Law Via the Internet Conference, Fabio Vitale, of the Department of Information Science at the University of Bologna, presented his work on the Akoma Ntoso project. ¹ This is an XML markup standard for legal materials (primarily legislation) sponsored by the U.N. ² Among the elements of the standard that they stressed as important were the descriptive and technical metadata to be included in the headers of every document. Even more significant is why they thought it was important. They considered the inclusion of descriptive metadata essential in order to facilitate the long-term preservation of the document.

To the extent that there are those who work with digital repositories who find this odd, there is a problem in the field. The Italians are right. For any digital item that is intended for preservation, metadata needs to be included in the data file itself. Here’s the thing: in order to have the best guarantee of long-term stability, the document itself needs to be as stable as possible. To the extent that the document exists only or primarily within another system, it becomes subject to the vulnerabilities of that other system in addition to whatever problems the document in itself may have. For this reason, I am convinced it is best that digital repositories be contructed with pointers to independently existing files. That decision being made (and also if it is not as well), the link between the metadata which identifies a computer file and the file is a real problem. Over time, it must be expected that at some point, the link between the metadata in a repository and the computer data file will be broken.

The solution, of course, is to keep at least a copy of the metadata in the data file. And, it should be there in a way that can be reliably and quickly recalled to rebuild both the repository and the links to the document files.

Now aside from the metadata system used (Dublin Core, etc.) and the content of the data, the data that is inserted needs to meet some standards:

The data must be machine readable and interpretable. In this context, I mean that it should not only be in something like RDF, but that it can be accessed easily using open source software, using open source protocol. Optimally, this would be plain ASCII or UTF-8 text. To the extent that this is not possible, something as openly accessible as possible is needed;
Needless to say, the metadata needs to be insertable and retrievable in some sort of standards compliant way;
The preservation format must be capable of accepting and storing such data in a manner that can easily be machine read.

At Rutgers – Camden, we have been striving to do this for some time with all our online collections. Depending on the format of the material, different methods need to be used, but in each case, the goal is to make each item in our collections capable of identifying itself in a machine readable manner. Each item does so in such a way that when there is a problem with the primary metadata repository, the database can be rebuilt by a program accessing the embedded copy of the metadata.

A good example of what we are doing is in our U.S. Congressional document collection. ³ For each document that is in the collection, there are four sets of files: the original TIFF format images (one image per page), a compressed PDF version of the TIFFs, a single html file that contains the OCR’ed text of all images, and finally, some large PDF’s (150 pages/file) for downloading. The three different formats provide examples of what can, and should, be done with items in a digital repository, and how to do it.

With the TIFF image, we use a technique used by Yves Lafon and Bert Bos of the World Wide Web Consortium (W3C) in an experiment they did with embedding descriptive metadata in photographs. ⁴ In this case, the use of TIFF’s is warranted because it is an open image standard and is the output format of the scanners. Unfortunately, the TIFF format does not have native support for a rich set of descriptive metadata, and at the time, extension of the available tag set is somewhat problematic. The solution of Messrs. Lafon and Bos was to take an RDF record of their own making, and embed the whole thing into the comment tag of the TIFF file. The technique has the advantage that the embedded RDF record can include anything and be of any length. Equally important, since the comment field in a TIFF is stored as plain text, the record can be retrieved by anything that can parse text.

Since we originally settled on using the TIFF comment field, the XMP standard has matured, and tools are finally available that will allow bulk processing of documents while at the same time taking advantage of the extensible nature of the standard. Although the tools now exist to embed our own extended XMP tags into the TIFF images, we will continue as we have been. The main reason is that even though a rich set of metadata can be embedded with well formed XMP tagging, it remains impossible to retrieve those tags absent proper extration software. In addition, as stated above, XMP is subject to Adobe Corporation’s copyright, so it is not an open standard.

In the case of PDF or JPG files, metadata can be inserted into areas of the files, but there is no ASCII field of indeterminate length such as the TIFF comment field. In both cases, there is either a comments or a keywords field defined by default, but these fields take only a small amount of data, which is stored as binary data. They will not accept a rich set metadata along with the accompanying markup, and are not as easily accessible as they should be.

Both jpegs and pdfs do, however, accept custom defined XMP tagging, embedded by a program such as exiftools. ⁵ This doesn’t resolve the issue of ready availability of the metadata, but it does facilitate rich metatagging. Since the data is in binary, exiftools, or some similar program is also needed to extract the data, which is a significant drawback. However, it is a workable solution.

Finally, with our html files, metadata is simply handled as standard formatted META tags. Originally, we were going to create these files as XML, but found that our Swish-e search engine had difficulties parsing XML where there were oddities in the character set. Unfortunately, with OCR’ed text, there is always some amount of oddity. For now, therefore, we use HTML and META tags. At some point, a transition to RDFa is something we would like to do.

1 http://www.akomantoso.org

3 Http://lawlibrary.rutgers.edu/gdoc/search.shtml.

4 http://www.w3.org/TR/photo-rdf/.

5 http://www.sno.phy.queensu.ca/~phil/exiftool/

8 responses to “Embeded Metadata Please”

M. Amaral | January 4, 2010 at 7:08 pm | Reply

Hello, and thank you for this insightful post. Embedding preservation metadata is an innovative solution to a well-anticipated problem. Since this process will be altering the original file, do you foresee any problems this may cause in your program?
I suppose that a condition for submission to the repository could be made that requires altering the file so that it can include embedded metadata, but I’m sure there would be some outcry that would include the words “file authenticity” or “integrity.” Do you anticipate this being an issue, or does the projected benefit of embedding preservation information trump most objections?
Thanks!
–Megan Amaral
jjoerg41 | January 5, 2010 at 5:26 pm | Reply

Thanks for your comment.

About authentication generally, please see: http://blog.law.cornell.edu/voxpop/2009/05/14/authentication-of-digital-repositories/. Unfortunately, it seems to be gone for the moment. I’m told it will be back soon.

In any case, the short answers are “yes”, “yes”, and “yes”. The longer answer requires a new post!
Pingback: Joergensen on Embedded Metadata & Harvesting Congressional Documents « Legal Informatics Blog
Pingback: Legal Information Issues at POGW: Princeton Open Government Workshop « Legal Informatics Blog
Pingback: New on VoxPopuLII: Zimmermann on jurMeta: A New Metadata Initiative for Legal Documents « Legal Informatics Blog
Pingback: » jurMeta - New Metadata Initiative for Legal Documents VoxPopuLII
Sal Leon | May 27, 2010 at 11:22 pm | Reply

Incredibly great read. Honest.
Ericka Quezada | May 30, 2010 at 2:29 pm | Reply

If I had a penny for every time I came here! Superb article.