Nutch is resilient to automated testing

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Nutch is resilient to automated testing

Rick Moynihan
Hi all,

A colleague I have been working with has developed a plugin to index
content with Nutch.  And though it does the job admirably, the
complexity and design of Nutch has proven resistent to easily writing
automated tests for this component.

I'm desperately trying to write some JUnit unit/integration tests for
this component, however Nutch doesn't make this simple enough, and I
fear this amongst other things is a barrier to Nutch adoption.

What I want to do is:

- Setup a Jetty server within the test with the content I want to index
(easy enough with CrawlDBTestUtil)
- Configure a crawl (i.e. fetch, index, merge, dedup etc...) and
override the configuration with my plugin and configuration.
- Store the index (preferably in memory, but on the disk is ok).
- assert that particular searches return items etc...

At first I thought this would be a simple matter of using
CrawlDBTestUtil to establish the server side, then using
org.apache.nutch.crawl.Crawl to perform all the relevant steps resulting
in an index of the content, which I can then run assertions on via

Ideally I'd like to create just one Configuration object, override the
settings as I wish, and then pass this object into Crawl and NutchBean

Sadly however org.apache.nutch.crawl.Crawl isn't really a class, as it
really only has a static main method which performs all the operations
in batch.  This design makes the class hard to reuse within the context
of my test.  This leaves me with the following options:

- call the main method and pass it an ugly array of Strings to do what I
require.  This is ugly due also to assumptions underlying the design of
this component (configuration files on the classpath etc...)  Also it
allows little or no reuse of configuration with other parts of the code
(e.g. NutchBean).

- Copy/Paste/Modify Crawl into my test.  The code in Crawl recently
changed to account for hadoop 0.17, so I don't really want to do this
only to find the API changes.  Plus I believe that tests should be
simple to read.  Explicitly performing 30 steps in order to test a
component isn't a good idea, as it hides the forest for the trees.

CrawlDBTestUtil is a step in the right direction, but more work is
needed.  Is it possible to get this marked as a bug/feature-request and
fixed in time for 1.0?

Thanks again for your help.