Category Archives: Document Harvesting

Writing screen scraping programs to build collections from material already on the Internet.

Harvesting Congressional Documents

A couple of years ago, we decided we had to wind down our collection of U.S. Congressional documents. Not only do we have no room for any more, we have no room for what we have. Our having too much problem has led to our digitization program, which I will have much about later. Moving forward, however, we still want to maintain a collection of congressional documents, even without the shelf-space. The solution is to download from the GPO and add them to our growing collection of digitized documents.

On one level, the idea for doing this is really simple: they are available on the GPO websites, just download and add them to the repository. I know, why not just wait until law.gov makes everything available in bulk, or until the GPO makes bulk electronic downloads part of the new depository system. If and when that happens, we will be on board. In the meantime, we would like to have a complete a collection of legislative history materials as possible. Without more paper.

As states, the basis idea is really simple: download the documents. Of course, dealing with thousands of documents in a programmatic manner makes it more of a challenge.

Downloading:

The technique we use for this is a slightly more than basic screen-scrape harvesting. But not by much. The theory was to use a PERL script which essentially would perform a search of the FedSys collection, and download the links to material that are in the search results page. This can be done, provided that there are predictable ways in which FedSys presents material, which fortunately, seems to be the case.

As it turns out, the approach with the FedSys material does not involve a search as such, but rather a structured drill-down into the menued browsing options presented in the system. In practice, the programming is similar to canning a search, but is more predictable. It also requires a little more looping in the program.

For reasons that probably have more to do with the way I think than anything else, I found that the easiest way to approach programming a drill-down to documents was by using the congressional committee browse page. It’s the page that is found here: http://www.gpo.gov/fdsys/browse/committeetab.action. The page looks like this:

The nice thing about this page is that we can start drilling down without dealing with javascript or anything else. At least at the start. From here, we can parse out the links for the individual committees. These have very regular structures, as with the Senate Finance Committee: http://www.gpo.gov/fdsys/browse/committee.action?chamber=senate&committee=finance. Of course, we could just program an array of committees of the House and Senate to make things faster, but by relying on this page, we can rely on the GPO to keep the listings of committees up to date. In addition, should the GPO alter their URL/directory structure, one small change in a regular expression will fix the whole thing. Much less work in the long run, and very little added burden to GPO.

What is actually going on in the PERL script at this point is that the wget utility is being opened as a file handle, with the download, when invoked, being sent to STDOUT. The download is started and read as a WHILE loop, which looks for the following pattern:

/http:\/\/www\.gpo\.gov\/fdsys\/browse\/committee\.action\?chamber=(\w)+\&committee=(.*?)\”>/ , where $1 is the chamber and $2 is the committee.

At this point, we have the information we need to take some shortcuts. The link we have grabbed will generate a page with Ajax code. That code will allow expansion of some categories, first by document type (hearings, prints, reports), and then by Congress. The thing to do is to add that information, so we can get to document links. The PERL line is this:

open(GETLIST, “wget \”http://www.gpo.gov/fdsys/browse/committeecong.action?collection= $collec&chamber=$chamb&committee=$comtee&congressplus

=$congno&ycord=0\” -q -O -|”); # all in one line, of course.

Where $collec is the document type (CHRG for hearings, CRPT for reports, or CPRT for prints), $chamb is the previously grabbed House or Senate, $comtee is the previously grabbed committee string, and $congno is the congress for which you want to gather material. Depending on what and how much you want, additional nested loops would be used to cycle through document types and Congresses. The page that will be opened and parsed actually appears like this:


As the above filehandle is read through it’s own while loop, the PDF, Text, and More links can be identified with a regular expression. In our case, what we really want is the “More” link. It creates a little more work, but is very worth it.

This “More” link does not download a document, but links to one last page. This page will contain a link to a ZIP file which contains both text and PDF versions of the document, as well as PREMIS and MODS metadata files. So, we grab the link to the “ more” page, download it to STDOUT like all the previous, and actually save the ZIP file with this:

system(“wget -O $filen -w 3 -nc –random-wait

\”http://www.gpo.gov/fdsys/delivery/getpackage.action?$zip\””);

where $zip is the file name of the ZIP file, and $filen is the filename that we want to save the file as.

A word on politeness to the GPO: most of those who would be inclined to actually do what I’m writing about here already know this, but in order for this sort of thing to work, it must be done in a manner that will not bring down the GPO servers. We’re not doing this to be mean, right? So, the “-w 3 -nc –r andom-wait” switches in that last wget call are very important. The -w 3 and -random-wait insure that the program will wait an average of 3 seconds before downloading. This slows things down, but relieves the potential load that a program like this might put on the remote server. In the case of the previous pages, this is not necessary because they are rather small XHTML files, and are only read once. The ZIP files are often in the megabytes, and, if you are looping throught Congresses and document types, there are thousands of them.

By all means, get them all for your own repository. But be kind.

Next Post: What to do with all these files once you have them.

First Post: Supreme Court Documents

As a digital services librarian, I am regularly asked what it is I do. I tell them: “I produce and maintain digital repositories”. The response is: “OK, but what do you actually, you know, do?” For those few who persist, it often becomes like a Lewis Carroll poem, with me spouting on, only to get the question repeated to me again, but without me getting hit. Unless you count: “Oh, OK, whatever.” as a slap of sorts.

As someone who does abstruse things with documents, this kind of thing is to be expected. However, the fact that I get this response from librarians and programmers alike gives me pause. This is useful work, but no one wants to do it. Librarians simply do not want to gain the programming skills that would allow them to exercise their unique skills in new and valuable ways. Programmers are generally averse to getting so involved with having to understand the text they would have to work with. No one seems to like projects where a computer program must do much of the work, but where the expected error levels make human monitoring and intervention a necessary part of any system created. No one has time to parse large bodies of information manually, but since computers can’t do it perfectly, it is dismissed as impossible.

To the extent that librarians (and systems people) do get involved in digital repositories, the tendency seems to be either to settle for very expensive off-the-shelf solutions that typically do a very generic job with generating essential metadata, and which do little or no really useful text processing, or they get involved in small projects that can be completed with a great deal of manual processing and metadata gathering. Again, very expensive, hence the small projects. So, we get big but not robust, or robust but very small.

I have found, however, that with every single project I have completed, that a middle approach tends to work very well. You can get big and pretty robust. In this article, I will describe one small project involving the “bound volumes” of the U.S. Reports that are available for download from the Supreme Court’s website. We have been processing these in order to add to our collection of U.S. Supreme court cases that we originally collected from public.resource.org.

My hope here is to convey a sense of how one can work with digital materials in order to create a robust and useable digital repository with little cost. I will not talk about programming per se, but about what I look for and think about when I write scripts that process material for a digital repository. For those who can program, I hope it may give a sense of what can be accomplished by paying close attention to the material. For those who know the material but do not program, I hope it gives a sense of what can be accomplished by thinking about the material from the point of view of a programmer, and how far you can get with a a little bit of skill and some imagination.

At this point, we still are not quite ready to roll these documents out for general viewing, but an example of the results can be seen here: http://lawlibrary.rutgers.edu/resource.org/US_reports/US/530/530.US.103_1.html.

The scripts that do the processing I describe are available for the asking. All I ask is that you treat them like any good open source project: submit your improvements, etc. Also, please try not to be too snarky when you comment on how much improving my code requires. I know I lack style. I know they are not optimized.

Supreme Court bound volumes are made available by the court at
http://www.supremecourtus.gov/opinions/boundvolumes.html . They consist of single PDF files, one large file per volume. They are a wonderful resource because, among other things, they contain the pagination of the bound volume, which is required for official citation. The problem , however, is that the volumes are not very searchable, the individual cases have no identifying markup, and there is no internal hyperlinking in the documents.

In order to address these issues, and in order to have our own edition of the US reports, we have been downloading these volumes and processing them to provide for good useability. What follows is an account of the steps we went through to prepare these documents for use. Please note that all processing was done on computers running Linux (Fedora 11), and that all software are either utilities included in a standard Linux distribution or easily available freeware.

Step 1: Bursting

After a volume is downloaded, the first thing to do is to split the large PDF file into individual pages. We do this using a freeware utility called pdftk. The developers of pdftk advertise it by saying: “If PDF is electronic paper, pdftk is the stapler, staple remover, hole punch, and binding machine.” Running it against a PDF file with the “burst” command1 will result in the creation of a series of one-page PDFs. These one page PDFs are then converted to ASCII text with the pdftotext freeware utility. By default, pdftotext does not attempt to retain any of the original formatting. However, using the -layout switch, basic formatting (tabs, justification, etc.) will be retained. For ease of use, these processes have are easily combined in a simple script.

Step 2: Get the Table of Cases.

At this point, we have individual pages of well formatted ASCII text. The sequential numbering of these files, however, does not match the pagination of the text, and we need to identify which pages each court opinion is on. To do this, we need to identify the table of cases (located near the beginning of the volume), parse the table to gather and match cases with page numbers. Then, we need to identify which file in the sequence of files contains “page 1”, and also identify any exceptions to the regular sequential pagination that may exist in the text. Fortunately, there are enough hints in the text, and regularity the format and text to allow a computer to do this for us. Finally, as we do all this, we will be identifying several pieces of useful metadata concerning each case. At this point, that metadata is kind of by-the-way, but we will save it for later use.

First step is a run through the text looking for the pages that make up the Table of Cases. The most useful facts about the Table are: 1. They use a string of elipses across the page connecting each case entry with its page numbers; 2. The table is at the beginning of each volume, starting and ending within the first few hundred pages of text; and 3. The Table is always succeeded immediately by the first reported case, and 4. the first page of reported cases (page 1), always contains the text “CASES ADJUDGED” (all caps) at the top of the page.

Using a bit of logic, we can make some rules that a computer can follow:

  1. Start searching the files sequentially from the beginning for a page that has several strings of elipses on it. If it is followed by several others, they are all part of the Table of Cases. Note them.
  2. Once you have identified pages that make up the Table, start looking for the phrase “CASES ADJUDGED” at the top of the page. When you see that, you have the whole table, and finally
  3. If you get to around file number 250 and are still getting pages that look like Table pages, quit with an error message.

At this point, the program that is doing the above will generate a list of the files that contain the Table of Cases pages. Also important, it has also identified the file that contains the all important Page 1 (i.e. the first page with arabic numerals, being the pages that will be cited). Before parsing the Table, however, there is a little more that needs to be done. This is because in every volume of the U.S. Reports, there will be one or more breaks in the page sequence to accommodate the insertion of last-minute material. These breaks occur between the opinions and orders, and between orders and any other matter that may be inserted after the orders. These breaks need to be located and the exceptions noted.

Fortunately, the places where these breaks occur are clearly identified with very regular text and formatting, and labeled as a “Reporter’s Note”. Here is an example:

Reporter’s Note

The next page is purposely numbered 1001. The numbers between 875 and 1001 were intentionally omitted, in order to make it possible to publish the orders with permanent page numbers, thus making the official citations available upon publication of the preliminary prints of the United States Reports.

For each occurence of this, there will be nothing else on the page. Given this, we make our parsing program continue searching the rest of the text pages looking for the phrase “Reporter’s Note” and, more significantly, “The next page is purposely numbered xxx.” The program will make a note of the file in which these notes occur, and the new page number specified in the note. Each time this occurs, we must abandon the previous page numbering sequence, and restart with filenumber + 1 = xxx. Also, keep in mind that any case listed after these breaks will be an order, not a full blown opinion. As such, there will usually be several per page, and they are not as significant as full blown decisions on the merits. Therefore, we will note in the metadata any decision listed in the tables that is assigned to these pages.

Step 3: Parse the Table of Cases

Now for parsing the table. Compared with what has already been done, parsing the Table is pretty straightforward. We make the computer go through each page line by line, making a note of lines that have a text, a series of elipses, followed by numbers. Grab the text, which will be the caption of the case, the numbers after the elipses, which are the text’s page numbers,2 and, using the information gathered from counting where page 1 starts, and the later changes, calculate which file that the first page of the listed case. We write this data to a file so it can be reviewed for problems before compiling opinion files. The file includes the caption, beginning page number, and beginning file number for each table of contents entry.

Step 4: Creating files of individual decisions

Creating a single file for each opinion is now a matter of parsing the above data file and appending files. This is done via a script. We are naming the files based on the U.S. Reports citation, derived from the page and volume data. While we are doing the appending, however, we have an opportunity to do something really useful: we enclose the content of each page in a set of <div> tags in the form: <div class=”page” value=X>. This is with a visble label <p class=”pnum”>Page X</p>. With these inclusions, a simple stylesheet can make the decision display with clearly visible pagination that accurately corresponds to the U.S. Reports. It is now a completely citeable document. Finally, the last thing we do with each page is to wrap the text in <pre> tags to preserve the basic formatting that was created when the page was converted from pdf.

At the same time that the documents are being assembled, we also find and assemble some metadata items that will create some specialized search fields and make the document self-identifying. We already have the caption and citation, the URL and md5sum of the original Pdf file. In addition, however, we can parse the first page of each decision, and lift out the docket number and date of decision. These are located in predictable locations near the beginning of those initial pages, surrounded by the kind of stylized formatting and language that make searching for such things pretty reliable. The processing script formats these data items as well-formed HTML meta headers and places them in the <head> section of the file that is being created. Finally, this metadata is also writted to a file that will be proofread for accuracy, then uploaded to a MySql table to be used in our user interface.

As for the table material, the parsed table of contents will, of course, contain a lot of data that will not require the creation of new documents. That is because there are usually a number of cases mentioned per page in these sections. We still want to provide access to this information, so we will create single page documents for each table page, but create still assemple a metadata record for each case entry. This will allow a party name search, etc. to produce results and return the table page. These metadata records will include the notation “TABLE,” which will be useful later during proofreading.

As mentioned above, we are not doing anything with the formatting of the decision itself other than using the -layout switch when converting the pages from pdf to text in pdftotext. This leaves a very readable document, and is and acceptable result. Unfortunately, footnotes do not come out particularly well, but they are still reliably located at the bottoms of each page. It is something, however, that we will be working to improve.

Step 5: Proofreading

At the beginning of this article, I talked about the “middle approach” to document processing that includes human editing as well as automated processing. Now that the opinions and accompanying metadata are assembled, it is time for the human intervention. It is tedious, but the metadata file must be reviewed for errors and omissions. I construct the metadata file as a quote-delimited text file, one record per line. The first thing to do is technically simple, have a text editor search for occurrences of “”. This would indicate that a piece of data was missed. In my experience, missing data is also an indicator of anything else that may have gone wrong, so a review and correction of the empty quotes results in a good dataset. Bringing each full opinion up in a browser and matching it with the table of contents is also recommended. At this point, we manually edit the metadata file to include missing data, and make any necessary edits to the decision files at the same time. These edits will not, of course, touch the text, but will include filling in the missing metadata in the files metatags, and moving any wrongly appended pages to their proper place.

With the processing scripts we have now, this editing was a week-long project (Not a full time week, but during off times at the reference desk, etc.) that accomplished the proofing of ten full volumes of the U.S. Reports. It was tedious. The amount of tedious labor, however, was reduced to a level that I could stand, and which made it possible to complete the project in something like a reasonable amount of time. Future volumes that are released by the court will be processed individually, and will take an hour or two each to process.

Conclusion

this was kept pretty non-technical and therefore not very specific as to the programmingdetails of what was actually done at each stage. The point of this article, however, was to describe the thought that goes into such a project, and to show that the most significant aspects of this are well within the expertise and experience of anyone who is familiar with the U.S. Reports, or legal decisions in general. The programming saves a lot of time by automating the routine, but, in an important sense, it not the key to the project. For those interested, please e-mail me at jjoerg@camden.rutgers.edu and I will send a copy of all the processing scripts. I am sure they can use improvement, but they do work.

Most important is that fact that this hour or two per volume results in a large set of very useful documents. Documents that we own and can make available to the public and our community. They are not fancy, but they are citeable, searchable in full-text, as well as by caption, date of decision, citation, and docket number. They are also fairly small in size and so don’t take up much disk space. So, they are easy and cheap to store. It will certainly not replace Westlaw or Lexis. At least not for the foreseeable future. But, for now, it creates another useful access point for those who don’t have access to those services. It also puts just that tiny bit of pressure on the big services to preserve their quality and keep their prices in line. It also constitutes an instance of actual distribution of critical legal information, the thing that, in the end, it the real, long term, justification for out existence as librarians.

1bash$ pdftk filename burst.

2. It is possible for there to be multiple table entries, or table entries and an opinion, in the same volume.

Continue reading