We have been working for a while now on digitizing the backfile of our school’s law review. After lots of testing and thinking about the subject, we are scanning to multipage TIFF’s, and producing OCR’ed PDF image-on-text files from those TIFF’s. The TIFFs will, of course, be the “official” archival copy, and the PDF will be the main production copy for actual use.
In order to assure that tiny footnotes will be readable when zoomed in on, we are scanning at 600 dpi. Some may think this overkill, but the tiny text needs to be readable. With the 600 dpi, we are getting good results with a bi-tonal scan (black and white, as opposed to grayscale). With these two parameters, we get a nice image that is searchable. We are also using, for obvious reasons, the article as the basic file unit. Given that these are rarely over 50 pages, and usually less, the file sized are not unwieldy, even after OCR and conversion to PDF.
Of course, there is nothing really interesting in this.
What may be interesting is that we have a standard metadata markup that is being embedded in the TIFFS, the PDF files, and loaded in a database for recall and manipulation. We are embedding in RDF in the TIFF comment section, and using XMP to embed in the PDF files.
This may be interesting because even now, a good Dublin Core based metadata set for journal articles is still a little problematic. Even this is not optimal, but I think it is quite useable. Here is a sample of what the RDF looks like:
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http/://purl.org/dc/terms/" xmlns:exif="http://www.w3.org/2003/12/exif/ns"> <rdf:Description rdf:about="URI:FILENAME_HERE"> <dc:title>ARTICLE_TITLE</dc:title> <dc:creator>AUTHOR</dc:creator> <dc:publisher>Rutgers University School of Law - Camden</dc:publisher> <dc:contributor> <rdf:Bag> <rdf:li>ALT_AUTHORS</rdf:li> </rdf:Bag> </dc:contributor> <dc:identifier> <dterms:bibliographicCitation>BLUEBOOK_CITATION</dterms:bibliographicCitation> </dc:identifier> <dc:relation> <dterms:isPartof> <dcterms:ISSN>44556-trx</dcterms:ISSN> </dterms:isPartof> </dc:relation> <dc:relation> <dcq:isPartof>Rutgers Law Journal</dcq:isPartof> </dc:relation> <dc:language> <dcterms:RFC1766>EN</dcterms:RFC1766> </dc:language> <dc:date> <dcq:created>PUBLICATION_DATE</dcq:created> </dc:date> <dc:date> <dcq:issued>FILE_CREATION_DATE</dcq:issued> </dc:date> <dc:format> <dcterms:IMT>image/tiff</dcterms:IMT> </dc:format> <dc:format> <dcterms:extent>>480202 bytes</dcterms:extent> </dc:format> <exif:compression>Fax</exif:compression> <exif:imageWidth>3363</exif:imageWidth> <exif:imageHeight>5415</exif:imageHeight> <exif:xResolution>600</exif:xResolution> <exif:yResolution>600</exif:yResolution> <exif:resolutionUnit>pixels</exif:resolutionUnit> <dc:format> Colors=Bilevel </dc:format> <exif:bitsPerSample>8</exif:bitsPerSample> <dc:relation> <dcq:isPartOf> <rdf:value>page 1 of TOTAL_PAGES</rdf:value> </dcq:isPartOf> </dc:relation> <dc:rights>Copyright, Rutgers University School of Law - Camden, or the Author, all rights reserved</dc:rights> </rdf:Description> </rdf:RDF>
Comments and suggestions are most welcome.
Update:
Edit Note: Thanks to Steffan Malmgren for suggestions on rationalizing the metadata structure that is now edited. To be clear, the first draft was created in 2005, when qualified Dublin Core in rdf was subject to recommendations that are now superceded (see http://dublincore.org/documents/dcmes-xml/ and http://dublincore.org/documents/dcq-rdf-xml/). I had not kept up with the fact that things have settled down to the single namespace: dcterms. That is now corrected. In addition, the use of bNodes was suggested to me in the distant past when first attempting to make sense of RDF notation. Ignorant madness.
As far as the technical image data goes, Steffan is, of course, correct to suggest a more suitable namespace, like EXIF. It should also be noted that every scanning software package I’m aware of inserts much of this data already, but encoding schemas, etc. tend to vary. So, some of this data is repetitive, but worth doing anyway.
Pingback: Tweets that mention Law Reviews: Scanning the Backfile « A Hacked Librarian -- Topsy.com
Great use of RDF (and also great that you are digitalizing this historic material)! A few things come to mind when reading the graph:
1) Why four different Dublin Core namespaces? Doesn’t dcterms contain all terms needed for expressing this data?
2) The use of BNodes (eg. the dc:relation reference to a bnode that in turn has a dcq:isPartOf reference to the actual value) seems wrong — aren’t you stating that the article has a relation with an unspecified resource which in turn is part of Rutgers Law Journal? Wouldn’t it be simpler (and arguably more correct) to state that the article dct:isPartOf “Rutgers Law Journal”?
3) Are you describing the article or the scanned file? To me, they seem to be different resources (one being a Expression, one being a Manifestation, to use FRBR terms). The article has properties like authors, issued date and so on, while the scanned file has properties like image size and density.
Lastly, for representing attributes of the scanned file, have you considered using other vocabularies than the dublin core ones, such as the NEPOMUK EXIF ontology (http://www.semanticdesktop.org/ontologies/2007/05/10/nexif/#)?
The above RDF is indeed problematic.
For example, it says that the format is a resource of *type* dcterms:extent, with rdf:value “480202 bytes”
The namespace http://dublincore.org/2004/09/20/dcq is also not part of any DC specification (though it’s luckily not used).
Pingback: Joergensen on Metadata for Digital Law Reviews « Legal Informatics Blog
Pingback: Journal digitalization and Dublin Core « I Never Metadata I Didn’t Like
Pingback: » jurMeta - New Metadata Initiative for Legal Documents VoxPopuLII
Pingback: New on VoxPopuLII: Zimmermann on jurMeta: A New Metadata Initiative for Legal Documents « Legal Informatics Blog