I've written an extension to the Internet Archive's open source "Heritrix"
crawler that extends it to write into HDFS in SequenceFile format. The key
is the URL and the value is the HTTP response with some additional
metadata. It's actually quite simple to use, just drop a few jar files into
the Heritrix lib/ directory and you're good to go. Here's a link to the
download page: http://www.zvents.com/labs/hdfs_writer_processor . For
those of you who are interested, give it a whirl and feel free to send me