A Hacked Librarian

Law Reviews: Scanning the Backfile

Posted on April 14, 2010 | 7 comments

We have been working for a while now on digitizing the backfile of our school’s law review. After lots of testing and thinking about the subject, we are scanning to multipage TIFF’s, and producing OCR’ed PDF image-on-text files from those TIFF’s. The TIFFs will, of course, be the “official” archival copy, and the PDF will be the main production copy for actual use.

In order to assure that tiny footnotes will be readable when zoomed in on, we are scanning at 600 dpi. Some may think this overkill, but the tiny text needs to be readable. With the 600 dpi, we are getting good results with a bi-tonal scan (black and white, as opposed to grayscale). With these two parameters, we get a nice image that is searchable. We are also using, for obvious reasons, the article as the basic file unit. Given that these are rarely over 50 pages, and usually less, the file sized are not unwieldy, even after OCR and conversion to PDF.

Of course, there is nothing really interesting in this.

What may be interesting is that we have a standard metadata markup that is being embedded in the TIFFS, the PDF files, and loaded in a database for recall and manipulation. We are embedding in RDF in the TIFF comment section, and using XMP to embed in the PDF files.

This may be interesting because even now, a good Dublin Core based metadata set for journal articles is still a little problematic. Even this is not optimal, but I think it is quite useable. Here is a sample of what the RDF looks like:

<?xml version="1.0"?>

<rdf:RDF  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:dcterms="http/://purl.org/dc/terms/"
        xmlns:exif="http://www.w3.org/2003/12/exif/ns">

<rdf:Description rdf:about="URI:FILENAME_HERE">
<dc:title>ARTICLE_TITLE</dc:title>
<dc:creator>AUTHOR</dc:creator>
<dc:publisher>Rutgers University School of Law - Camden</dc:publisher>
<dc:contributor>
        <rdf:Bag>
                <rdf:li>ALT_AUTHORS</rdf:li>
        </rdf:Bag>
</dc:contributor>
<dc:identifier>
        <dterms:bibliographicCitation>BLUEBOOK_CITATION</dterms:bibliographicCitation>
</dc:identifier>
<dc:relation>
        <dterms:isPartof>
               <dcterms:ISSN>44556-trx</dcterms:ISSN>
        </dterms:isPartof>
</dc:relation>
<dc:relation>
        <dcq:isPartof>Rutgers Law Journal</dcq:isPartof>
</dc:relation>
<dc:language>
        <dcterms:RFC1766>EN</dcterms:RFC1766>
</dc:language>
<dc:date>
        <dcq:created>PUBLICATION_DATE</dcq:created>
</dc:date>
<dc:date>
        <dcq:issued>FILE_CREATION_DATE</dcq:issued>
</dc:date>
<dc:format>
        <dcterms:IMT>image/tiff</dcterms:IMT>
</dc:format>
<dc:format>
<dcterms:extent>>480202 bytes</dcterms:extent>
</dc:format>
<exif:compression>Fax</exif:compression>
<exif:imageWidth>3363</exif:imageWidth>
<exif:imageHeight>5415</exif:imageHeight>
<exif:xResolution>600</exif:xResolution>
<exif:yResolution>600</exif:yResolution>
<exif:resolutionUnit>pixels</exif:resolutionUnit>
<dc:format>
        Colors=Bilevel
</dc:format>
 <exif:bitsPerSample>8</exif:bitsPerSample>
<dc:relation>
        <dcq:isPartOf>
              <rdf:value>page 1 of TOTAL_PAGES</rdf:value>
        </dcq:isPartOf>
</dc:relation>
<dc:rights>Copyright, Rutgers University School of Law - Camden, or the Author, all rights reserved</dc:rights>
</rdf:Description>
</rdf:RDF>

Comments and suggestions are most welcome.

Update:

Edit Note: Thanks to Steffan Malmgren for suggestions on rationalizing the metadata structure that is now edited. To be clear, the first draft was created in 2005, when qualified Dublin Core in rdf was subject to recommendations that are now superceded (see http://dublincore.org/documents/dcmes-xml/ and http://dublincore.org/documents/dcq-rdf-xml/). I had not kept up with the fact that things have settled down to the single namespace: dcterms. That is now corrected. In addition, the use of bNodes was suggested to me in the distant past when first attempting to make sense of RDF notation. Ignorant madness.

As far as the technical image data goes, Steffan is, of course, correct to suggest a more suitable namespace, like EXIF. It should also be noted that every scanning software package I’m aware of inserts much of this data already, but encoding schemas, etc. tend to vary. So, some of this data is repetitive, but worth doing anyway.

7 Comments

Posted in Law Journal Markup, RDF-metadata

Harvesting Congressional Documents

Posted on January 21, 2010 | 2 comments

A couple of years ago, we decided we had to wind down our collection of U.S. Congressional documents. Not only do we have no room for any more, we have no room for what we have. Our having too much problem has led to our digitization program, which I will have much about later. Moving forward, however, we still want to maintain a collection of congressional documents, even without the shelf-space. The solution is to download from the GPO and add them to our growing collection of digitized documents.

On one level, the idea for doing this is really simple: they are available on the GPO websites, just download and add them to the repository. I know, why not just wait until law.gov makes everything available in bulk, or until the GPO makes bulk electronic downloads part of the new depository system. If and when that happens, we will be on board. In the meantime, we would like to have a complete a collection of legislative history materials as possible. Without more paper.

As states, the basis idea is really simple: download the documents. Of course, dealing with thousands of documents in a programmatic manner makes it more of a challenge.

Downloading:

The technique we use for this is a slightly more than basic screen-scrape harvesting. But not by much. The theory was to use a PERL script which essentially would perform a search of the FedSys collection, and download the links to material that are in the search results page. This can be done, provided that there are predictable ways in which FedSys presents material, which fortunately, seems to be the case.

As it turns out, the approach with the FedSys material does not involve a search as such, but rather a structured drill-down into the menued browsing options presented in the system. In practice, the programming is similar to canning a search, but is more predictable. It also requires a little more looping in the program.

For reasons that probably have more to do with the way I think than anything else, I found that the easiest way to approach programming a drill-down to documents was by using the congressional committee browse page. It’s the page that is found here: http://www.gpo.gov/fdsys/browse/committeetab.action. The page looks like this:

The nice thing about this page is that we can start drilling down without dealing with javascript or anything else. At least at the start. From here, we can parse out the links for the individual committees. These have very regular structures, as with the Senate Finance Committee: http://www.gpo.gov/fdsys/browse/committee.action?chamber=senate&committee=finance. Of course, we could just program an array of committees of the House and Senate to make things faster, but by relying on this page, we can rely on the GPO to keep the listings of committees up to date. In addition, should the GPO alter their URL/directory structure, one small change in a regular expression will fix the whole thing. Much less work in the long run, and very little added burden to GPO.

What is actually going on in the PERL script at this point is that the wget utility is being opened as a file handle, with the download, when invoked, being sent to STDOUT. The download is started and read as a WHILE loop, which looks for the following pattern:

/http:\/\/www\.gpo\.gov\/fdsys\/browse\/committee\.action\?chamber=(\w)+\&committee=(.*?)\”>/ , where $1 is the chamber and $2 is the committee.

At this point, we have the information we need to take some shortcuts. The link we have grabbed will generate a page with Ajax code. That code will allow expansion of some categories, first by document type (hearings, prints, reports), and then by Congress. The thing to do is to add that information, so we can get to document links. The PERL line is this:

open(GETLIST, “wget \”http://www.gpo.gov/fdsys/browse/committeecong.action?collection= $collec&chamber=$chamb&committee=$comtee&congressplus

=$congno&ycord=0\” -q -O -|”); # all in one line, of course.

Where $collec is the document type (CHRG for hearings, CRPT for reports, or CPRT for prints), $chamb is the previously grabbed House or Senate, $comtee is the previously grabbed committee string, and $congno is the congress for which you want to gather material. Depending on what and how much you want, additional nested loops would be used to cycle through document types and Congresses. The page that will be opened and parsed actually appears like this:

As the above filehandle is read through it’s own while loop, the PDF, Text, and More links can be identified with a regular expression. In our case, what we really want is the “More” link. It creates a little more work, but is very worth it.

This “More” link does not download a document, but links to one last page. This page will contain a link to a ZIP file which contains both text and PDF versions of the document, as well as PREMIS and MODS metadata files. So, we grab the link to the “ more” page, download it to STDOUT like all the previous, and actually save the ZIP file with this:

system(“wget -O $filen -w 3 -nc –random-wait

\”http://www.gpo.gov/fdsys/delivery/getpackage.action?$zip\””);

where $zip is the file name of the ZIP file, and $filen is the filename that we want to save the file as.

A word on politeness to the GPO: most of those who would be inclined to actually do what I’m writing about here already know this, but in order for this sort of thing to work, it must be done in a manner that will not bring down the GPO servers. We’re not doing this to be mean, right? So, the “-w 3 -nc –r andom-wait” switches in that last wget call are very important. The -w 3 and -random-wait insure that the program will wait an average of 3 seconds before downloading. This slows things down, but relieves the potential load that a program like this might put on the remote server. In the case of the previous pages, this is not necessary because they are rather small XHTML files, and are only read once. The ZIP files are often in the megabytes, and, if you are looping throught Congresses and document types, there are thousands of them.

By all means, get them all for your own repository. But be kind.

Next Post: What to do with all these files once you have them.

2 Comments

Posted in Document Harvesting

Embedded Metadata, Part Deux: Authentication

Posted on January 14, 2010 | 2 comments

One of the comments about my last post, by Ms. Amaral, concerned the issue of authenticity of documents which have been altered by the addition of metadata in the way I described. I think it is an important matter, and warrants some discussion.

First off, what is actually meant by “authenticity” ought to be addressed. In a strict sense, anything that is copied from another source is, of course, no longer the authentic item any longer. But that is not how the term is used in digital libraries. An authentic item is something that, in the end, seems to comply with an idea of accuracy in reproducing an original. That idea of accuracy, moreover, is dependent on the information intended to be conveyed.

That may sound marvellously vague , especially in the context of the words “accurate” and “authentic”, but I think it fits our actual use of the word. For example, in the case of, say, a court decision, where the essential information consists of the text and whatever formatting is essential to preserve the context of the text (paragraph structure, block quotes, footnotes, etc.). In this case, a conversion from a word processor format to HTML or XML could easily retain all the necessary characteristics for any reasonable person to allow that it is “authentic”. In fact, in the case of Lexis, Westlaw, and most commercial news databases, this is standard.

In contrast to the above example, a scan of an ancient text, for example, might be considered differently. In a case where the original appearance of the object is what is of interest, an “ authentic” digital copy would need to retain much more data, and could only be achieved with an image of reasonably high resolution and color depth. Such that where the object of debate is whether the “e” in a unique original is really an “o” with a mold stain, you can zoom in and closely compare that “e” with other “e’s”. And so on. In the case of a scan where such is the purpose, but the resolution is insufficient, it is not sufficiently “authentic” for the purpose.

In non-digital library contexts, this notion of authenticity is even more apparent. The example of translations makes this clear. What is an “authentic” translation? In a strict sense, the question is nonsense. In normal parlance, however, we use the term “authentic translation” in a utilitarian sense. There can be many a piece of literature, all “authentic” translations, and all different. One for the general reader, taking more liberties with literal meaning to make the work as readable in the new language as the old. One for the scholar, sacrificing the original’s readability for a more literal meaning. One for the artist, attempting to preserve literary depth of the original with some sacrifice to both readability and literal accuracy. All could be different, but all competent, and each an “authentic” rendition of the original.

Of course, with documents in a digital library, the editorial decisions of the language translator don’t happen. We do, however, make other types of decisions, like converting between text and image formats, etc. Each decision, including the decision to do nothing, has its advantages and disadvantages. But in the end, the practical accuracy of the information is what counts. I suggest that a document is still perfectly authentic even if converted from Microsoft Word to HTML, even if the footnotes look a little different. As long as you don’t change the text itself.

Having said all that, there is still a substantive issue concerning authenticity, but it is not the authenticity itself we are concerned with, but proof of authenticity. This is an issue because the popular methods of establishing authenticity of digital objects involve digital signatures, which do not verify the accuracy of the human interpret-able information, but whether the digital content of the object is identical to the original. Now, this does indeed verify authenticity, but only in a very strict sense which does not comport with practical usage. It is not practical because it does not allow for the inclusion of essential preservation metadata which alters absolutely nothing in the appearance of the object from the original in any way . It excludes the possibility of converting from an original format which may be highly unsuitable for preservation, to another format which would both preserve the information and be suitable for preserving the object. Finally, it excludes the possibility of making useful, non-editorial enhancements, such as hyperlinking of citations.

There really ought to be a better way. Or at least a better way to use digital signatures.

Along the lines of a better way to use digital signatures, I have suggested that a good approach to authentication would involve preserving original digital signatures as part of maintaining an object’s provenance. Any necessary change made to the digital object ought naturally be documented as part of the document’s metadata. Part of that documentation can and ought to include the digital signatures of previous versions. The result is a non-original object, but with the possibility of having rich metadata records, value-added features and durable formats with a verifiable history documenting exactly what has changed. This would have the effect of re-authenticating at the point of any acceptable alteration, as well as providing a record for investigating the source of any discrepancies that may ever be discovered.

I could go on at length about this here, but I have already done that in an entry at the Cornell Legal Information Institute’s VoxPop blog: http://blog.law.cornell.edu/voxpop/2009/05/14/authentication-of-digital-repositories/ (as of 1/14/2010, this article was off-line at Cornell. I have been assured, however, that it will be back up in the very near future.)

2 Comments

Posted in Authentication

Embeded Metadata Please

Posted on January 4, 2010 | 8 comments

I hate to say it, but a rich (and expensive) set of metadata, carefully ingested into a Fedora Repository will not save you. If that is all you are doing, your repository is doomed. Unfortunately, it appears that many repositories are in just that situation.

Recently, at the 2009 Law Via the Internet Conference, Fabio Vitale, of the Department of Information Science at the University of Bologna, presented his work on the Akoma Ntoso project. ¹ This is an XML markup standard for legal materials (primarily legislation) sponsored by the U.N. ² Among the elements of the standard that they stressed as important were the descriptive and technical metadata to be included in the headers of every document. Even more significant is why they thought it was important. They considered the inclusion of descriptive metadata essential in order to facilitate the long-term preservation of the document.

To the extent that there are those who work with digital repositories who find this odd, there is a problem in the field. The Italians are right. For any digital item that is intended for preservation, metadata needs to be included in the data file itself. Here’s the thing: in order to have the best guarantee of long-term stability, the document itself needs to be as stable as possible. To the extent that the document exists only or primarily within another system, it becomes subject to the vulnerabilities of that other system in addition to whatever problems the document in itself may have. For this reason, I am convinced it is best that digital repositories be contructed with pointers to independently existing files. That decision being made (and also if it is not as well), the link between the metadata which identifies a computer file and the file is a real problem. Over time, it must be expected that at some point, the link between the metadata in a repository and the computer data file will be broken.

The solution, of course, is to keep at least a copy of the metadata in the data file. And, it should be there in a way that can be reliably and quickly recalled to rebuild both the repository and the links to the document files.

Now aside from the metadata system used (Dublin Core, etc.) and the content of the data, the data that is inserted needs to meet some standards:

The data must be machine readable and interpretable. In this context, I mean that it should not only be in something like RDF, but that it can be accessed easily using open source software, using open source protocol. Optimally, this would be plain ASCII or UTF-8 text. To the extent that this is not possible, something as openly accessible as possible is needed;
Needless to say, the metadata needs to be insertable and retrievable in some sort of standards compliant way;
The preservation format must be capable of accepting and storing such data in a manner that can easily be machine read.

At Rutgers – Camden, we have been striving to do this for some time with all our online collections. Depending on the format of the material, different methods need to be used, but in each case, the goal is to make each item in our collections capable of identifying itself in a machine readable manner. Each item does so in such a way that when there is a problem with the primary metadata repository, the database can be rebuilt by a program accessing the embedded copy of the metadata.

A good example of what we are doing is in our U.S. Congressional document collection. ³ For each document that is in the collection, there are four sets of files: the original TIFF format images (one image per page), a compressed PDF version of the TIFFs, a single html file that contains the OCR’ed text of all images, and finally, some large PDF’s (150 pages/file) for downloading. The three different formats provide examples of what can, and should, be done with items in a digital repository, and how to do it.

With the TIFF image, we use a technique used by Yves Lafon and Bert Bos of the World Wide Web Consortium (W3C) in an experiment they did with embedding descriptive metadata in photographs. ⁴ In this case, the use of TIFF’s is warranted because it is an open image standard and is the output format of the scanners. Unfortunately, the TIFF format does not have native support for a rich set of descriptive metadata, and at the time, extension of the available tag set is somewhat problematic. The solution of Messrs. Lafon and Bos was to take an RDF record of their own making, and embed the whole thing into the comment tag of the TIFF file. The technique has the advantage that the embedded RDF record can include anything and be of any length. Equally important, since the comment field in a TIFF is stored as plain text, the record can be retrieved by anything that can parse text.

Since we originally settled on using the TIFF comment field, the XMP standard has matured, and tools are finally available that will allow bulk processing of documents while at the same time taking advantage of the extensible nature of the standard. Although the tools now exist to embed our own extended XMP tags into the TIFF images, we will continue as we have been. The main reason is that even though a rich set of metadata can be embedded with well formed XMP tagging, it remains impossible to retrieve those tags absent proper extration software. In addition, as stated above, XMP is subject to Adobe Corporation’s copyright, so it is not an open standard.

In the case of PDF or JPG files, metadata can be inserted into areas of the files, but there is no ASCII field of indeterminate length such as the TIFF comment field. In both cases, there is either a comments or a keywords field defined by default, but these fields take only a small amount of data, which is stored as binary data. They will not accept a rich set metadata along with the accompanying markup, and are not as easily accessible as they should be.

Both jpegs and pdfs do, however, accept custom defined XMP tagging, embedded by a program such as exiftools. ⁵ This doesn’t resolve the issue of ready availability of the metadata, but it does facilitate rich metatagging. Since the data is in binary, exiftools, or some similar program is also needed to extract the data, which is a significant drawback. However, it is a workable solution.

Finally, with our html files, metadata is simply handled as standard formatted META tags. Originally, we were going to create these files as XML, but found that our Swish-e search engine had difficulties parsing XML where there were oddities in the character set. Unfortunately, with OCR’ed text, there is always some amount of oddity. For now, therefore, we use HTML and META tags. At some point, a transition to RDFa is something we would like to do.

1 http://www.akomantoso.org

2 http://www.lawviatheinternet.org/Presentations/VitaliPalmirani.ppt (link to Prof. Vitali’s PowerPoint © presentation).

3 Http://lawlibrary.rutgers.edu/gdoc/search.shtml.

4 http://www.w3.org/TR/photo-rdf/.

5 http://www.sno.phy.queensu.ca/~phil/exiftool/

8 Comments

Posted in Embeded Metadata

First Post: Supreme Court Documents

Posted on November 18, 2009 | 5 comments

As a digital services librarian, I am regularly asked what it is I do. I tell them: “I produce and maintain digital repositories”. The response is: “OK, but what do you actually, you know, do?” For those few who persist, it often becomes like a Lewis Carroll poem, with me spouting on, only to get the question repeated to me again, but without me getting hit. Unless you count: “Oh, OK, whatever.” as a slap of sorts.

As someone who does abstruse things with documents, this kind of thing is to be expected. However, the fact that I get this response from librarians and programmers alike gives me pause. This is useful work, but no one wants to do it. Librarians simply do not want to gain the programming skills that would allow them to exercise their unique skills in new and valuable ways. Programmers are generally averse to getting so involved with having to understand the text they would have to work with. No one seems to like projects where a computer program must do much of the work, but where the expected error levels make human monitoring and intervention a necessary part of any system created. No one has time to parse large bodies of information manually, but since computers can’t do it perfectly, it is dismissed as impossible.

To the extent that librarians (and systems people) do get involved in digital repositories, the tendency seems to be either to settle for very expensive off-the-shelf solutions that typically do a very generic job with generating essential metadata, and which do little or no really useful text processing, or they get involved in small projects that can be completed with a great deal of manual processing and metadata gathering. Again, very expensive, hence the small projects. So, we get big but not robust, or robust but very small.

I have found, however, that with every single project I have completed, that a middle approach tends to work very well. You can get big and pretty robust. In this article, I will describe one small project involving the “bound volumes” of the U.S. Reports that are available for download from the Supreme Court’s website. We have been processing these in order to add to our collection of U.S. Supreme court cases that we originally collected from public.resource.org.

My hope here is to convey a sense of how one can work with digital materials in order to create a robust and useable digital repository with little cost. I will not talk about programming per se, but about what I look for and think about when I write scripts that process material for a digital repository. For those who can program, I hope it may give a sense of what can be accomplished by paying close attention to the material. For those who know the material but do not program, I hope it gives a sense of what can be accomplished by thinking about the material from the point of view of a programmer, and how far you can get with a a little bit of skill and some imagination.

At this point, we still are not quite ready to roll these documents out for general viewing, but an example of the results can be seen here: http://lawlibrary.rutgers.edu/resource.org/US_reports/US/530/530.US.103_1.html.

The scripts that do the processing I describe are available for the asking. All I ask is that you treat them like any good open source project: submit your improvements, etc. Also, please try not to be too snarky when you comment on how much improving my code requires. I know I lack style. I know they are not optimized.

Supreme Court bound volumes are made available by the court at
http://www.supremecourtus.gov/opinions/boundvolumes.html . They consist of single PDF files, one large file per volume. They are a wonderful resource because, among other things, they contain the pagination of the bound volume, which is required for official citation. The problem , however, is that the volumes are not very searchable, the individual cases have no identifying markup, and there is no internal hyperlinking in the documents.

In order to address these issues, and in order to have our own edition of the US reports, we have been downloading these volumes and processing them to provide for good useability. What follows is an account of the steps we went through to prepare these documents for use. Please note that all processing was done on computers running Linux (Fedora 11), and that all software are either utilities included in a standard Linux distribution or easily available freeware.

Step 1: Bursting

After a volume is downloaded, the first thing to do is to split the large PDF file into individual pages. We do this using a freeware utility called pdftk. The developers of pdftk advertise it by saying: “If PDF is electronic paper, pdftk is the stapler, staple remover, hole punch, and binding machine.” Running it against a PDF file with the “burst” command^¹ will result in the creation of a series of one-page PDFs. These one page PDFs are then converted to ASCII text with the pdftotext freeware utility. By default, pdftotext does not attempt to retain any of the original formatting. However, using the -layout switch, basic formatting (tabs, justification, etc.) will be retained. For ease of use, these processes have are easily combined in a simple script.

Step 2: Get the Table of Cases.

At this point, we have individual pages of well formatted ASCII text. The sequential numbering of these files, however, does not match the pagination of the text, and we need to identify which pages each court opinion is on. To do this, we need to identify the table of cases (located near the beginning of the volume), parse the table to gather and match cases with page numbers. Then, we need to identify which file in the sequence of files contains “page 1”, and also identify any exceptions to the regular sequential pagination that may exist in the text. Fortunately, there are enough hints in the text, and regularity the format and text to allow a computer to do this for us. Finally, as we do all this, we will be identifying several pieces of useful metadata concerning each case. At this point, that metadata is kind of by-the-way, but we will save it for later use.

First step is a run through the text looking for the pages that make up the Table of Cases. The most useful facts about the Table are: 1. They use a string of elipses across the page connecting each case entry with its page numbers; 2. The table is at the beginning of each volume, starting and ending within the first few hundred pages of text; and 3. The Table is always succeeded immediately by the first reported case, and 4. the first page of reported cases (page 1), always contains the text “CASES ADJUDGED” (all caps) at the top of the page.

Using a bit of logic, we can make some rules that a computer can follow:

Start searching the files sequentially from the beginning for a page that has several strings of elipses on it. If it is followed by several others, they are all part of the Table of Cases. Note them.
Once you have identified pages that make up the Table, start looking for the phrase “CASES ADJUDGED” at the top of the page. When you see that, you have the whole table, and finally
If you get to around file number 250 and are still getting pages that look like Table pages, quit with an error message.

At this point, the program that is doing the above will generate a list of the files that contain the Table of Cases pages. Also important, it has also identified the file that contains the all important Page 1 (i.e. the first page with arabic numerals, being the pages that will be cited). Before parsing the Table, however, there is a little more that needs to be done. This is because in every volume of the U.S. Reports, there will be one or more breaks in the page sequence to accommodate the insertion of last-minute material. These breaks occur between the opinions and orders, and between orders and any other matter that may be inserted after the orders. These breaks need to be located and the exceptions noted.

Fortunately, the places where these breaks occur are clearly identified with very regular text and formatting, and labeled as a “Reporter’s Note”. Here is an example:

Reporter’s Note

The next page is purposely numbered 1001. The numbers between 875 and 1001 were intentionally omitted, in order to make it possible to publish the orders with permanent page numbers, thus making the official citations available upon publication of the preliminary prints of the United States Reports.

For each occurence of this, there will be nothing else on the page. Given this, we make our parsing program continue searching the rest of the text pages looking for the phrase “Reporter’s Note” and, more significantly, “The next page is purposely numbered xxx.” The program will make a note of the file in which these notes occur, and the new page number specified in the note. Each time this occurs, we must abandon the previous page numbering sequence, and restart with filenumber + 1 = xxx. Also, keep in mind that any case listed after these breaks will be an order, not a full blown opinion. As such, there will usually be several per page, and they are not as significant as full blown decisions on the merits. Therefore, we will note in the metadata any decision listed in the tables that is assigned to these pages.

Step 3: Parse the Table of Cases

Now for parsing the table. Compared with what has already been done, parsing the Table is pretty straightforward. We make the computer go through each page line by line, making a note of lines that have a text, a series of elipses, followed by numbers. Grab the text, which will be the caption of the case, the numbers after the elipses, which are the text’s page numbers,^² and, using the information gathered from counting where page 1 starts, and the later changes, calculate which file that the first page of the listed case. We write this data to a file so it can be reviewed for problems before compiling opinion files. The file includes the caption, beginning page number, and beginning file number for each table of contents entry.

Step 4: Creating files of individual decisions

Creating a single file for each opinion is now a matter of parsing the above data file and appending files. This is done via a script. We are naming the files based on the U.S. Reports citation, derived from the page and volume data. While we are doing the appending, however, we have an opportunity to do something really useful: we enclose the content of each page in a set of <div> tags in the form: <div class=”page” value=X>. This is with a visble label <p class=”pnum”>Page X</p>. With these inclusions, a simple stylesheet can make the decision display with clearly visible pagination that accurately corresponds to the U.S. Reports. It is now a completely citeable document. Finally, the last thing we do with each page is to wrap the text in <pre> tags to preserve the basic formatting that was created when the page was converted from pdf.

At the same time that the documents are being assembled, we also find and assemble some metadata items that will create some specialized search fields and make the document self-identifying. We already have the caption and citation, the URL and md5sum of the original Pdf file. In addition, however, we can parse the first page of each decision, and lift out the docket number and date of decision. These are located in predictable locations near the beginning of those initial pages, surrounded by the kind of stylized formatting and language that make searching for such things pretty reliable. The processing script formats these data items as well-formed HTML meta headers and places them in the <head> section of the file that is being created. Finally, this metadata is also writted to a file that will be proofread for accuracy, then uploaded to a MySql table to be used in our user interface.

As for the table material, the parsed table of contents will, of course, contain a lot of data that will not require the creation of new documents. That is because there are usually a number of cases mentioned per page in these sections. We still want to provide access to this information, so we will create single page documents for each table page, but create still assemple a metadata record for each case entry. This will allow a party name search, etc. to produce results and return the table page. These metadata records will include the notation “TABLE,” which will be useful later during proofreading.

As mentioned above, we are not doing anything with the formatting of the decision itself other than using the -layout switch when converting the pages from pdf to text in pdftotext. This leaves a very readable document, and is and acceptable result. Unfortunately, footnotes do not come out particularly well, but they are still reliably located at the bottoms of each page. It is something, however, that we will be working to improve.

Step 5: Proofreading

At the beginning of this article, I talked about the “middle approach” to document processing that includes human editing as well as automated processing. Now that the opinions and accompanying metadata are assembled, it is time for the human intervention. It is tedious, but the metadata file must be reviewed for errors and omissions. I construct the metadata file as a quote-delimited text file, one record per line. The first thing to do is technically simple, have a text editor search for occurrences of “”. This would indicate that a piece of data was missed. In my experience, missing data is also an indicator of anything else that may have gone wrong, so a review and correction of the empty quotes results in a good dataset. Bringing each full opinion up in a browser and matching it with the table of contents is also recommended. At this point, we manually edit the metadata file to include missing data, and make any necessary edits to the decision files at the same time. These edits will not, of course, touch the text, but will include filling in the missing metadata in the files metatags, and moving any wrongly appended pages to their proper place.

With the processing scripts we have now, this editing was a week-long project (Not a full time week, but during off times at the reference desk, etc.) that accomplished the proofing of ten full volumes of the U.S. Reports. It was tedious. The amount of tedious labor, however, was reduced to a level that I could stand, and which made it possible to complete the project in something like a reasonable amount of time. Future volumes that are released by the court will be processed individually, and will take an hour or two each to process.

Conclusion

this was kept pretty non-technical and therefore not very specific as to the programmingdetails of what was actually done at each stage. The point of this article, however, was to describe the thought that goes into such a project, and to show that the most significant aspects of this are well within the expertise and experience of anyone who is familiar with the U.S. Reports, or legal decisions in general. The programming saves a lot of time by automating the routine, but, in an important sense, it not the key to the project. For those interested, please e-mail me at jjoerg@camden.rutgers.edu and I will send a copy of all the processing scripts. I am sure they can use improvement, but they do work.

Most important is that fact that this hour or two per volume results in a large set of very useful documents. Documents that we own and can make available to the public and our community. They are not fancy, but they are citeable, searchable in full-text, as well as by caption, date of decision, citation, and docket number. They are also fairly small in size and so don’t take up much disk space. So, they are easy and cheap to store. It will certainly not replace Westlaw or Lexis. At least not for the foreseeable future. But, for now, it creates another useful access point for those who don’t have access to those services. It also puts just that tiny bit of pressure on the big services to preserve their quality and keep their prices in line. It also constitutes an instance of actual distribution of critical legal information, the thing that, in the end, it the real, long term, justification for out existence as librarians.

1bash$ pdftk filename burst.

2. It is possible for there to be multiple table entries, or table entries and an opinion, in the same volume.

Continue reading →

5 Comments

Posted in Document Harvesting