Monthly Archives: January 2010

Harvesting Congressional Documents

A couple of years ago, we decided we had to wind down our collection of U.S. Congressional documents. Not only do we have no room for any more, we have no room for what we have. Our having too much problem has led to our digitization program, which I will have much about later. Moving forward, however, we still want to maintain a collection of congressional documents, even without the shelf-space. The solution is to download from the GPO and add them to our growing collection of digitized documents.

On one level, the idea for doing this is really simple: they are available on the GPO websites, just download and add them to the repository. I know, why not just wait until makes everything available in bulk, or until the GPO makes bulk electronic downloads part of the new depository system. If and when that happens, we will be on board. In the meantime, we would like to have a complete a collection of legislative history materials as possible. Without more paper.

As states, the basis idea is really simple: download the documents. Of course, dealing with thousands of documents in a programmatic manner makes it more of a challenge.


The technique we use for this is a slightly more than basic screen-scrape harvesting. But not by much. The theory was to use a PERL script which essentially would perform a search of the FedSys collection, and download the links to material that are in the search results page. This can be done, provided that there are predictable ways in which FedSys presents material, which fortunately, seems to be the case.

As it turns out, the approach with the FedSys material does not involve a search as such, but rather a structured drill-down into the menued browsing options presented in the system. In practice, the programming is similar to canning a search, but is more predictable. It also requires a little more looping in the program.

For reasons that probably have more to do with the way I think than anything else, I found that the easiest way to approach programming a drill-down to documents was by using the congressional committee browse page. It’s the page that is found here: The page looks like this:

The nice thing about this page is that we can start drilling down without dealing with javascript or anything else. At least at the start. From here, we can parse out the links for the individual committees. These have very regular structures, as with the Senate Finance Committee: Of course, we could just program an array of committees of the House and Senate to make things faster, but by relying on this page, we can rely on the GPO to keep the listings of committees up to date. In addition, should the GPO alter their URL/directory structure, one small change in a regular expression will fix the whole thing. Much less work in the long run, and very little added burden to GPO.

What is actually going on in the PERL script at this point is that the wget utility is being opened as a file handle, with the download, when invoked, being sent to STDOUT. The download is started and read as a WHILE loop, which looks for the following pattern:

/http:\/\/www\.gpo\.gov\/fdsys\/browse\/committee\.action\?chamber=(\w)+\&committee=(.*?)\”>/ , where $1 is the chamber and $2 is the committee.

At this point, we have the information we need to take some shortcuts. The link we have grabbed will generate a page with Ajax code. That code will allow expansion of some categories, first by document type (hearings, prints, reports), and then by Congress. The thing to do is to add that information, so we can get to document links. The PERL line is this:

open(GETLIST, “wget \” $collec&chamber=$chamb&committee=$comtee&congressplus

=$congno&ycord=0\” -q -O -|”); # all in one line, of course.

Where $collec is the document type (CHRG for hearings, CRPT for reports, or CPRT for prints), $chamb is the previously grabbed House or Senate, $comtee is the previously grabbed committee string, and $congno is the congress for which you want to gather material. Depending on what and how much you want, additional nested loops would be used to cycle through document types and Congresses. The page that will be opened and parsed actually appears like this:

As the above filehandle is read through it’s own while loop, the PDF, Text, and More links can be identified with a regular expression. In our case, what we really want is the “More” link. It creates a little more work, but is very worth it.

This “More” link does not download a document, but links to one last page. This page will contain a link to a ZIP file which contains both text and PDF versions of the document, as well as PREMIS and MODS metadata files. So, we grab the link to the “ more” page, download it to STDOUT like all the previous, and actually save the ZIP file with this:

system(“wget -O $filen -w 3 -nc –random-wait


where $zip is the file name of the ZIP file, and $filen is the filename that we want to save the file as.

A word on politeness to the GPO: most of those who would be inclined to actually do what I’m writing about here already know this, but in order for this sort of thing to work, it must be done in a manner that will not bring down the GPO servers. We’re not doing this to be mean, right? So, the “-w 3 -nc –r andom-wait” switches in that last wget call are very important. The -w 3 and -random-wait insure that the program will wait an average of 3 seconds before downloading. This slows things down, but relieves the potential load that a program like this might put on the remote server. In the case of the previous pages, this is not necessary because they are rather small XHTML files, and are only read once. The ZIP files are often in the megabytes, and, if you are looping throught Congresses and document types, there are thousands of them.

By all means, get them all for your own repository. But be kind.

Next Post: What to do with all these files once you have them.


Embedded Metadata, Part Deux: Authentication

One of the comments about my last post, by Ms. Amaral, concerned the issue of authenticity of documents which have been altered by the addition of metadata in the way I described. I think it is an important matter, and warrants some discussion.

First off, what is actually meant by “authenticity” ought to be addressed. In a strict sense, anything that is copied from another source is, of course, no longer the authentic item any longer. But that is not how the term is used in digital libraries. An authentic item is something that, in the end, seems to comply with an idea of accuracy in reproducing an original. That idea of accuracy, moreover, is dependent on the information intended to be conveyed.

That may sound marvellously vague , especially in the context of the words “accurate” and “authentic”, but I think it fits our actual use of the word. For example, in the case of, say, a court decision, where the essential information consists of the text and whatever formatting is essential to preserve the context of the text (paragraph structure, block quotes, footnotes, etc.). In this case, a conversion from a word processor format to HTML or XML could easily retain all the necessary characteristics for any reasonable person to allow that it is “authentic”. In fact, in the case of Lexis, Westlaw, and most commercial news databases, this is standard.

In contrast to the above example, a scan of an ancient text, for example, might be considered differently. In a case where the original appearance of the object is what is of interest, an “ authentic” digital copy would need to retain much more data, and could only be achieved with an image of reasonably high resolution and color depth. Such that where the object of debate is whether the “e” in a unique original is really an “o” with a mold stain, you can zoom in and closely compare that “e” with other “e’s”. And so on. In the case of a scan where such is the purpose, but the resolution is insufficient, it is not sufficiently “authentic” for the purpose.

In non-digital library contexts, this notion of authenticity is even more apparent. The example of translations makes this clear. What is an “authentic” translation? In a strict sense, the question is nonsense. In normal parlance, however, we use the term “authentic translation” in a utilitarian sense. There can be many a piece of literature, all “authentic” translations, and all different. One for the general reader, taking more liberties with literal meaning to make the work as readable in the new language as the old. One for the scholar, sacrificing the original’s readability for a more literal meaning. One for the artist, attempting to preserve literary depth of the original with some sacrifice to both readability and literal accuracy. All could be different, but all competent, and each an “authentic” rendition of the original.

Of course, with documents in a digital library, the editorial decisions of the language translator don’t happen. We do, however, make other types of decisions, like converting between text and image formats, etc. Each decision, including the decision to do nothing, has its advantages and disadvantages. But in the end, the practical accuracy of the information is what counts. I suggest that a document is still perfectly authentic even if converted from Microsoft Word to HTML, even if the footnotes look a little different. As long as you don’t change the text itself.

Having said all that, there is still a substantive issue concerning authenticity, but it is not the authenticity itself we are concerned with, but proof of authenticity. This is an issue because the popular methods of establishing authenticity of digital objects involve digital signatures, which do not verify the accuracy of the human interpret-able information, but whether the digital content of the object is identical to the original. Now, this does indeed verify authenticity, but only in a very strict sense which does not comport with practical usage. It is not practical because it does not allow for the inclusion of essential preservation metadata which alters absolutely nothing in the appearance of the object from the original in any way . It excludes the possibility of converting from an original format which may be highly unsuitable for preservation, to another format which would both preserve the information and be suitable for preserving the object. Finally, it excludes the possibility of making useful, non-editorial enhancements, such as hyperlinking of citations.

There really ought to be a better way. Or at least a better way to use digital signatures.

Along the lines of a better way to use digital signatures, I have suggested that a good approach to authentication would involve preserving original digital signatures as part of maintaining an object’s provenance. Any necessary change made to the digital object ought naturally be documented as part of the document’s metadata. Part of that documentation can and ought to include the digital signatures of previous versions. The result is a non-original object, but with the possibility of having rich metadata records, value-added features and durable formats with a verifiable history documenting exactly what has changed. This would have the effect of re-authenticating at the point of any acceptable alteration, as well as providing a record for investigating the source of any discrepancies that may ever be discovered.

I could go on at length about this here, but I have already done that in an entry at the Cornell Legal Information Institute’s VoxPop blog: (as of 1/14/2010, this article was off-line at Cornell. I have been assured, however, that it will be back up in the very near future.)