I have been trying to integrate the NutchRest with my web application and it seem to be working but I realize that the crawling when invoked via the application is going to take a long time as it is a candidate for the batch process.
I have been thinking of having
1) N number of NutchRest servers.
2) Queuing System which will get the messages when nutch completes/finished each Job.
3) The Queuing System should be smart enough to send the subsequent Job processing to NutchRest servers, may be we can have a pluggable Algorithm for it. We can have RoundRobin as default.
This way we could scale things however it would require the code changes in the Nutch, each Job when completed need to be sending the events to the Queuing system and the NutchRest Server. The Queuing system needs to manage the work flow.
I am hoping someone might have though on these lines and may have implemented, I would be interested to know the opinion of the folks here about it.