As a digital services librarian, I am regularly asked what it is I do. I tell them: “I produce and maintain digital repositories”. The response is: “OK, but what do you actually, you know, do?” For those few who persist, it often becomes like a Lewis Carroll poem, with me spouting on, only to get the question repeated to me again, but without me getting hit. Unless you count: “Oh, OK, whatever.” as a slap of sorts.
As someone who does abstruse things with documents, this kind of thing is to be expected. However, the fact that I get this response from librarians and programmers alike gives me pause. This is useful work, but no one wants to do it. Librarians simply do not want to gain the programming skills that would allow them to exercise their unique skills in new and valuable ways. Programmers are generally averse to getting so involved with having to understand the text they would have to work with. No one seems to like projects where a computer program must do much of the work, but where the expected error levels make human monitoring and intervention a necessary part of any system created. No one has time to parse large bodies of information manually, but since computers can’t do it perfectly, it is dismissed as impossible.
To the extent that librarians (and systems people) do get involved in digital repositories, the tendency seems to be either to settle for very expensive off-the-shelf solutions that typically do a very generic job with generating essential metadata, and which do little or no really useful text processing, or they get involved in small projects that can be completed with a great deal of manual processing and metadata gathering. Again, very expensive, hence the small projects. So, we get big but not robust, or robust but very small.
I have found, however, that with every single project I have completed, that a middle approach tends to work very well. You can get big and pretty robust. In this article, I will describe one small project involving the “bound volumes” of the U.S. Reports that are available for download from the Supreme Court’s website. We have been processing these in order to add to our collection of U.S. Supreme court cases that we originally collected from public.resource.org.
My hope here is to convey a sense of how one can work with digital materials in order to create a robust and useable digital repository with little cost. I will not talk about programming per se, but about what I look for and think about when I write scripts that process material for a digital repository. For those who can program, I hope it may give a sense of what can be accomplished by paying close attention to the material. For those who know the material but do not program, I hope it gives a sense of what can be accomplished by thinking about the material from the point of view of a programmer, and how far you can get with a a little bit of skill and some imagination.
At this point, we still are not quite ready to roll these documents out for general viewing, but an example of the results can be seen here: http://lawlibrary.rutgers.edu/resource.org/US_reports/US/530/530.US.103_1.html.
The scripts that do the processing I describe are available for the asking. All I ask is that you treat them like any good open source project: submit your improvements, etc. Also, please try not to be too snarky when you comment on how much improving my code requires. I know I lack style. I know they are not optimized.
Supreme Court bound volumes are made available by the court at
http://www.supremecourtus.gov/opinions/boundvolumes.html . They consist of single PDF files, one large file per volume. They are a wonderful resource because, among other things, they contain the pagination of the bound volume, which is required for official citation. The problem , however, is that the volumes are not very searchable, the individual cases have no identifying markup, and there is no internal hyperlinking in the documents.
In order to address these issues, and in order to have our own edition of the US reports, we have been downloading these volumes and processing them to provide for good useability. What follows is an account of the steps we went through to prepare these documents for use. Please note that all processing was done on computers running Linux (Fedora 11), and that all software are either utilities included in a standard Linux distribution or easily available freeware.
Step 1: Bursting
After a volume is downloaded, the first thing to do is to split the large PDF file into individual pages. We do this using a freeware utility called pdftk. The developers of pdftk advertise it by saying: “If PDF is electronic paper, pdftk is the stapler, staple remover, hole punch, and binding machine.” Running it against a PDF file with the “burst” command1 will result in the creation of a series of one-page PDFs. These one page PDFs are then converted to ASCII text with the pdftotext freeware utility. By default, pdftotext does not attempt to retain any of the original formatting. However, using the -layout switch, basic formatting (tabs, justification, etc.) will be retained. For ease of use, these processes have are easily combined in a simple script.
Step 2: Get the Table of Cases.
At this point, we have individual pages of well formatted ASCII text. The sequential numbering of these files, however, does not match the pagination of the text, and we need to identify which pages each court opinion is on. To do this, we need to identify the table of cases (located near the beginning of the volume), parse the table to gather and match cases with page numbers. Then, we need to identify which file in the sequence of files contains “page 1”, and also identify any exceptions to the regular sequential pagination that may exist in the text. Fortunately, there are enough hints in the text, and regularity the format and text to allow a computer to do this for us. Finally, as we do all this, we will be identifying several pieces of useful metadata concerning each case. At this point, that metadata is kind of by-the-way, but we will save it for later use.
First step is a run through the text looking for the pages that make up the Table of Cases. The most useful facts about the Table are: 1. They use a string of elipses across the page connecting each case entry with its page numbers; 2. The table is at the beginning of each volume, starting and ending within the first few hundred pages of text; and 3. The Table is always succeeded immediately by the first reported case, and 4. the first page of reported cases (page 1), always contains the text “CASES ADJUDGED” (all caps) at the top of the page.
Using a bit of logic, we can make some rules that a computer can follow:
- Start searching the files sequentially from the beginning for a page that has several strings of elipses on it. If it is followed by several others, they are all part of the Table of Cases. Note them.
- Once you have identified pages that make up the Table, start looking for the phrase “CASES ADJUDGED” at the top of the page. When you see that, you have the whole table, and finally
- If you get to around file number 250 and are still getting pages that look like Table pages, quit with an error message.
At this point, the program that is doing the above will generate a list of the files that contain the Table of Cases pages. Also important, it has also identified the file that contains the all important Page 1 (i.e. the first page with arabic numerals, being the pages that will be cited). Before parsing the Table, however, there is a little more that needs to be done. This is because in every volume of the U.S. Reports, there will be one or more breaks in the page sequence to accommodate the insertion of last-minute material. These breaks occur between the opinions and orders, and between orders and any other matter that may be inserted after the orders. These breaks need to be located and the exceptions noted.
Fortunately, the places where these breaks occur are clearly identified with very regular text and formatting, and labeled as a “Reporter’s Note”. Here is an example:
The next page is purposely numbered 1001. The numbers between 875 and 1001 were intentionally omitted, in order to make it possible to publish the orders with permanent page numbers, thus making the official citations available upon publication of the preliminary prints of the United States Reports.
For each occurence of this, there will be nothing else on the page. Given this, we make our parsing program continue searching the rest of the text pages looking for the phrase “Reporter’s Note” and, more significantly, “The next page is purposely numbered xxx.” The program will make a note of the file in which these notes occur, and the new page number specified in the note. Each time this occurs, we must abandon the previous page numbering sequence, and restart with filenumber + 1 = xxx. Also, keep in mind that any case listed after these breaks will be an order, not a full blown opinion. As such, there will usually be several per page, and they are not as significant as full blown decisions on the merits. Therefore, we will note in the metadata any decision listed in the tables that is assigned to these pages.
Step 3: Parse the Table of Cases
Now for parsing the table. Compared with what has already been done, parsing the Table is pretty straightforward. We make the computer go through each page line by line, making a note of lines that have a text, a series of elipses, followed by numbers. Grab the text, which will be the caption of the case, the numbers after the elipses, which are the text’s page numbers,2 and, using the information gathered from counting where page 1 starts, and the later changes, calculate which file that the first page of the listed case. We write this data to a file so it can be reviewed for problems before compiling opinion files. The file includes the caption, beginning page number, and beginning file number for each table of contents entry.
Step 4: Creating files of individual decisions
Creating a single file for each opinion is now a matter of parsing the above data file and appending files. This is done via a script. We are naming the files based on the U.S. Reports citation, derived from the page and volume data. While we are doing the appending, however, we have an opportunity to do something really useful: we enclose the content of each page in a set of <div> tags in the form: <div class=”page” value=X>. This is with a visble label <p class=”pnum”>Page X</p>. With these inclusions, a simple stylesheet can make the decision display with clearly visible pagination that accurately corresponds to the U.S. Reports. It is now a completely citeable document. Finally, the last thing we do with each page is to wrap the text in <pre> tags to preserve the basic formatting that was created when the page was converted from pdf.
At the same time that the documents are being assembled, we also find and assemble some metadata items that will create some specialized search fields and make the document self-identifying. We already have the caption and citation, the URL and md5sum of the original Pdf file. In addition, however, we can parse the first page of each decision, and lift out the docket number and date of decision. These are located in predictable locations near the beginning of those initial pages, surrounded by the kind of stylized formatting and language that make searching for such things pretty reliable. The processing script formats these data items as well-formed HTML meta headers and places them in the <head> section of the file that is being created. Finally, this metadata is also writted to a file that will be proofread for accuracy, then uploaded to a MySql table to be used in our user interface.
As for the table material, the parsed table of contents will, of course, contain a lot of data that will not require the creation of new documents. That is because there are usually a number of cases mentioned per page in these sections. We still want to provide access to this information, so we will create single page documents for each table page, but create still assemple a metadata record for each case entry. This will allow a party name search, etc. to produce results and return the table page. These metadata records will include the notation “TABLE,” which will be useful later during proofreading.
As mentioned above, we are not doing anything with the formatting of the decision itself other than using the -layout switch when converting the pages from pdf to text in pdftotext. This leaves a very readable document, and is and acceptable result. Unfortunately, footnotes do not come out particularly well, but they are still reliably located at the bottoms of each page. It is something, however, that we will be working to improve.
Step 5: Proofreading
At the beginning of this article, I talked about the “middle approach” to document processing that includes human editing as well as automated processing. Now that the opinions and accompanying metadata are assembled, it is time for the human intervention. It is tedious, but the metadata file must be reviewed for errors and omissions. I construct the metadata file as a quote-delimited text file, one record per line. The first thing to do is technically simple, have a text editor search for occurrences of “”. This would indicate that a piece of data was missed. In my experience, missing data is also an indicator of anything else that may have gone wrong, so a review and correction of the empty quotes results in a good dataset. Bringing each full opinion up in a browser and matching it with the table of contents is also recommended. At this point, we manually edit the metadata file to include missing data, and make any necessary edits to the decision files at the same time. These edits will not, of course, touch the text, but will include filling in the missing metadata in the files metatags, and moving any wrongly appended pages to their proper place.
With the processing scripts we have now, this editing was a week-long project (Not a full time week, but during off times at the reference desk, etc.) that accomplished the proofing of ten full volumes of the U.S. Reports. It was tedious. The amount of tedious labor, however, was reduced to a level that I could stand, and which made it possible to complete the project in something like a reasonable amount of time. Future volumes that are released by the court will be processed individually, and will take an hour or two each to process.
this was kept pretty non-technical and therefore not very specific as to the programmingdetails of what was actually done at each stage. The point of this article, however, was to describe the thought that goes into such a project, and to show that the most significant aspects of this are well within the expertise and experience of anyone who is familiar with the U.S. Reports, or legal decisions in general. The programming saves a lot of time by automating the routine, but, in an important sense, it not the key to the project. For those interested, please e-mail me at firstname.lastname@example.org and I will send a copy of all the processing scripts. I am sure they can use improvement, but they do work.
Most important is that fact that this hour or two per volume results in a large set of very useful documents. Documents that we own and can make available to the public and our community. They are not fancy, but they are citeable, searchable in full-text, as well as by caption, date of decision, citation, and docket number. They are also fairly small in size and so don’t take up much disk space. So, they are easy and cheap to store. It will certainly not replace Westlaw or Lexis. At least not for the foreseeable future. But, for now, it creates another useful access point for those who don’t have access to those services. It also puts just that tiny bit of pressure on the big services to preserve their quality and keep their prices in line. It also constitutes an instance of actual distribution of critical legal information, the thing that, in the end, it the real, long term, justification for out existence as librarians.
1bash$ pdftk filename burst.
2. It is possible for there to be multiple table entries, or table entries and an opinion, in the same volume.