Nutch and workflow for scaling.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Nutch and workflow for scaling.

vickyk
Hello Guys,

I have been trying to integrate the NutchRest with my web application and it seem to be working but I realize that the crawling when invoked via the application is going to take a long time as it is a candidate for the batch process.

I have been thinking of having

1) N number of NutchRest servers.
2) Queuing System which will get the messages when nutch completes/finished each Job.
3) The Queuing System should be smart enough to send the subsequent Job processing to NutchRest servers, may be we can have a pluggable Algorithm for it. We can have RoundRobin as default.

This way we could scale things however it would require the code changes in the Nutch, each Job when completed need to be sending the events to the Queuing system and the NutchRest Server. The Queuing system needs to manage the work flow.

I am hoping someone might have though on these lines and may have implemented, I would be interested to know the opinion of the folks here about it.

Hoping to hear more about it.

Thanks,
Vicky
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Nutch and workflow for scaling.

vickyk
It seems this is what I have been proposing, I still have to check the code base
https://github.com/USCDataScience/sparkler

Thanks,
Vicky
Loading...