We have been working for a while now on digitizing the backfile of our school’s law review. After lots of testing and thinking about the subject, we are scanning to multipage TIFF’s, and producing OCR’ed PDF image-on-text files from those TIFF’s. The TIFFs will, of course, be the “official” archival copy, and the PDF will be the main production copy for actual use.
In order to assure that tiny footnotes will be readable when zoomed in on, we are scanning at 600 dpi. Some may think this overkill, but the tiny text needs to be readable. With the 600 dpi, we are getting good results with a bi-tonal scan (black and white, as opposed to grayscale). With these two parameters, we get a nice image that is searchable. We are also using, for obvious reasons, the article as the basic file unit. Given that these are rarely over 50 pages, and usually less, the file sized are not unwieldy, even after OCR and conversion to PDF.
Of course, there is nothing really interesting in this.
What may be interesting is that we have a standard metadata markup that is being embedded in the TIFFS, the PDF files, and loaded in a database for recall and manipulation. We are embedding in RDF in the TIFF comment section, and using XMP to embed in the PDF files.
This may be interesting because even now, a good Dublin Core based metadata set for journal articles is still a little problematic. Even this is not optimal, but I think it is quite useable. Here is a sample of what the RDF looks like:
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http/://purl.org/dc/terms/"
xmlns:exif="http://www.w3.org/2003/12/exif/ns">
<rdf:Description rdf:about="URI:FILENAME_HERE">
<dc:title>ARTICLE_TITLE</dc:title>
<dc:creator>AUTHOR</dc:creator>
<dc:publisher>Rutgers University School of Law - Camden</dc:publisher>
<dc:contributor>
<rdf:Bag>
<rdf:li>ALT_AUTHORS</rdf:li>
</rdf:Bag>
</dc:contributor>
<dc:identifier>
<dterms:bibliographicCitation>BLUEBOOK_CITATION</dterms:bibliographicCitation>
</dc:identifier>
<dc:relation>
<dterms:isPartof>
<dcterms:ISSN>44556-trx</dcterms:ISSN>
</dterms:isPartof>
</dc:relation>
<dc:relation>
<dcq:isPartof>Rutgers Law Journal</dcq:isPartof>
</dc:relation>
<dc:language>
<dcterms:RFC1766>EN</dcterms:RFC1766>
</dc:language>
<dc:date>
<dcq:created>PUBLICATION_DATE</dcq:created>
</dc:date>
<dc:date>
<dcq:issued>FILE_CREATION_DATE</dcq:issued>
</dc:date>
<dc:format>
<dcterms:IMT>image/tiff</dcterms:IMT>
</dc:format>
<dc:format>
<dcterms:extent>>480202 bytes</dcterms:extent>
</dc:format>
<exif:compression>Fax</exif:compression>
<exif:imageWidth>3363</exif:imageWidth>
<exif:imageHeight>5415</exif:imageHeight>
<exif:xResolution>600</exif:xResolution>
<exif:yResolution>600</exif:yResolution>
<exif:resolutionUnit>pixels</exif:resolutionUnit>
<dc:format>
Colors=Bilevel
</dc:format>
<exif:bitsPerSample>8</exif:bitsPerSample>
<dc:relation>
<dcq:isPartOf>
<rdf:value>page 1 of TOTAL_PAGES</rdf:value>
</dcq:isPartOf>
</dc:relation>
<dc:rights>Copyright, Rutgers University School of Law - Camden, or the Author, all rights reserved</dc:rights>
</rdf:Description>
</rdf:RDF>
Comments and suggestions are most welcome.
Update:
Edit Note: Thanks to Steffan Malmgren for suggestions on rationalizing the metadata structure that is now edited. To be clear, the first draft was created in 2005, when qualified Dublin Core in rdf was subject to recommendations that are now superceded (see http://dublincore.org/documents/dcmes-xml/ and http://dublincore.org/documents/dcq-rdf-xml/). I had not kept up with the fact that things have settled down to the single namespace: dcterms. That is now corrected. In addition, the use of bNodes was suggested to me in the distant past when first attempting to make sense of RDF notation. Ignorant madness.
As far as the technical image data goes, Steffan is, of course, correct to suggest a more suitable namespace, like EXIF. It should also be noted that every scanning software package I’m aware of inserts much of this data already, but encoding schemas, etc. tend to vary. So, some of this data is repetitive, but worth doing anyway.


