Nutch Crawl Vs. Merge Time Complexity

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Nutch Crawl Vs. Merge Time Complexity

Hi there,

I got a couple of questions that I need help with, Please help.

I'm sort of new to this nutch-dev emailing listing. I'm not quite should how or what's the appropriate way of getting envolve with the Nutch development group. Please let me know Who should I be contacting in regards to issue and question about Nutch?

I've been using Nutch and customizing it so that the returned search results can be manage by the use of paging on the web. I'm doing this for my company and my supervisor has agreed to contribute the code for paging to the nutch community. Please help guide me on how to proceed with this.

Finally, a technical question. I've using Nutch v0.7 and I've been running nutch on our company unix system and it was setup to crawl our intranet sites for updates daily, I've tried using the Merge, dedup, updatedb, and etc...I'd notice the time complexity and efficiency was less productive than doing a fresh new crawl. For example if I have two separate crawls from two different domains such as hotmail and yahoo, what would the time complexity for nutch to crawl this two domains and then do a merge compare to just doing a single full crawl of both domains? My guess would be that it will take nutch the same amount of times to do either one, if that is so is there a reason to use the Merge at all? Please let me know what you think, I'm still trying to understand how nutch behave, don't mean to criticize anyone who've work on the Merge feature for nutch.



Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze.