removing site from webdb

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

removing site from webdb

waterwheel
We've got a site that is causing our crawl to slow dramatically, from
20mbits down to about 3 or 4.  The basic problem is that the site seems
to consist of huge numbers of pages that aren't responding.  We can
remove the site from the index, but it seems like a problem to remove
this site permanently from the webdb so that we never fetch it again.  
Is there an easy way in 0.71 to remove a site from the webdb, and then
keep it permanently removed?


Reply | Threaded
Open this post in threaded view
|

Re: removing site from webdb

Rod Taylor-2
On Fri, 2006-03-17 at 13:44 -0500, Insurance Squared Inc. wrote:
> We've got a site that is causing our crawl to slow dramatically, from
> 20mbits down to about 3 or 4.  The basic problem is that the site seems
> to consist of huge numbers of pages that aren't responding.  We can
> remove the site from the index, but it seems like a problem to remove
> this site permanently from the webdb so that we never fetch it again.  
> Is there an easy way in 0.71 to remove a site from the webdb, and then
> keep it permanently removed?

You can add a filter on that domain to your regex-urlfilter.txt file, or
you can allow nutch to churn though each URL and mark it as invalid
individually.

This process can be done quite quickly if Nutch scales the number of
threads to achieve the best use of bandwidth.

Encourage the Nutch folks to apply this patch. I give it 50Mbits and
Nutch will scale up to 500 threads per task if most threads are hitting
bad pages or down to about 60 threads per task if they're downloading
large pages. In the end we stay within about 10% of the 50Mbit target.

http://issues.apache.org/jira/browse/NUTCH-207


--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: removing site from webdb

kangas
In reply to this post by waterwheel
An easy way to do this for Nutch 0.7.1:
- Adjust regex-urlfilter.txt (as Rod mentioned), or some other  
component of your URLFilter chain, to screen out the site
- Run my PruneDBTool to force all URLs in the webdb through the  
URLFilter chain again

Code is here:
http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

(It won't work for 0.8. Hopefully won't be necessary, 'tho.)

--Matt

On Mar 17, 2006, at 1:44 PM, Insurance Squared Inc. wrote:

> We've got a site that is causing our crawl to slow dramatically,  
> from 20mbits down to about 3 or 4.  The basic problem is that the  
> site seems to consist of huge numbers of pages that aren't  
> responding.  We can remove the site from the index, but it seems  
> like a problem to remove this site permanently from the webdb so  
> that we never fetch it again.  Is there an easy way in 0.71 to  
> remove a site from the webdb, and then keep it permanently removed?
>
>

--
Matt Kangas / [hidden email]