Best and economical way of setting hadoop cluster for distributed crawling
I have been running nutch in local mode and so far I am able to have a good
understanding on how it all works.
I wanted to start with distributed crawling using some public cloud
I just wanted to know if fellow users have any experience in setting up
nutch for distributed crawling.
From nutch wiki I have some idea on what hardware requirements should be.
I just wanted to know which of the public cloud providers (IaaS or PaaS)
are good to setup hadoop clusters on. Basically ones on which it is easy to
setup/manage the cluster and ones which are easy on budget.
Please let me know if you folks have any insights based on your experiences.