Proposal for Avoiding Content Generation Sites

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposal for Avoiding Content Generation Sites

Rod Taylor-2
We've indexed several content generation sites that we want to
eliminate. One had hundreds of thousands of sub-domains spread across
several domains (up to 50M pages in total). Quite annoying.

First is to allow for cleaning up.  This consists of a new option to
"updatedb" which can scrub the database of all URLs which no longer
match URLFilter settings (regex-urlfilter.txt). This allows a change in
the urlfilter to be reflected against Nutches current dataset, something
I think others have asked for in the past.

Second is to treat a subdomain as being in the same bucket as the domain
for the generator.  This means that *.domain.com or *.domain.co.uk would
create 2 buckets instead of one per hostname. There is a high likely
hood that sub-domains will be on the same servers as the primary domain
and should be rate-limited as such.  generate.max.per.host would work
more as generate.max.per.domain instead.


Third is ongoing detection. I would like to add a feature to Nutch which
could report anomalies during updatedb or generate. For example, if any
given domain.com bucket during generate is found to have more than 5000
URLs to be downloaded, it should be flagged for a manual review. Write a
record to a text file which can be read and picked up by a person to
confirm that we haven't gotten into a garbage content generation site.
If we are in a content generation site, the person would add this domain
to the urlfilter and the next updatedb would clean out all URLs from
that location.


Are there any thoughts or objections to this? One and 2 are pretty
straight forward. Detection is not so easy.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Proposal for Avoiding Content Generation Sites

kangas
Hi Rod,

Re (1): I have a "prunedb" tool written for 0.7. I'd be happy to  
contribute it, but it is nearly trivial after all (~100 lines), and  
it won't work for 0.8.

Detection as you proposed would be great for catching breadth-traps,  
but it probably should be pluggable. A standalone tool that the user  
runs before "generate" would work, and could be easily swapped out.

BTW: I have an initial solution for depth-traps implemented for 0.7.  
I haven't had time to post the code anywhere yet, etc, but if you're  
interested, let me know.

--Matt

On Mar 8, 2006, at 12:27 PM, Rod Taylor wrote:

> We've indexed several content generation sites that we want to
> eliminate. One had hundreds of thousands of sub-domains spread across
> several domains (up to 50M pages in total). Quite annoying.
>
> First is to allow for cleaning up.  This consists of a new option to
> "updatedb" which can scrub the database of all URLs which no longer
> match URLFilter settings (regex-urlfilter.txt). This allows a  
> change in
> the urlfilter to be reflected against Nutches current dataset,  
> something
> I think others have asked for in the past.
>
> Second is to treat a subdomain as being in the same bucket as the  
> domain
> for the generator.  This means that *.domain.com or *.domain.co.uk  
> would
> create 2 buckets instead of one per hostname. There is a high likely
> hood that sub-domains will be on the same servers as the primary  
> domain
> and should be rate-limited as such.  generate.max.per.host would work
> more as generate.max.per.domain instead.
>
>
> Third is ongoing detection. I would like to add a feature to Nutch  
> which
> could report anomalies during updatedb or generate. For example, if  
> any
> given domain.com bucket during generate is found to have more than  
> 5000
> URLs to be downloaded, it should be flagged for a manual review.  
> Write a
> record to a text file which can be read and picked up by a person to
> confirm that we haven't gotten into a garbage content generation site.
> If we are in a content generation site, the person would add this  
> domain
> to the urlfilter and the next updatedb would clean out all URLs from
> that location.
>
>
> Are there any thoughts or objections to this? One and 2 are pretty
> straight forward. Detection is not so easy.
>
> --
> Rod Taylor <[hidden email]>
>

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

RE: Proposal for Avoiding Content Generation Sites

Gal Nitzan
In reply to this post by Rod Taylor-2
Actually there is a property in conf: generate.max.per.host

So if you add a message in Generator.java at the appropriate place... you
have what you wish...

Gal


-----Original Message-----
From: Rod Taylor [mailto:[hidden email]]
Sent: Wednesday, March 08, 2006 7:28 PM
To: Nutch Developer List
Subject: Proposal for Avoiding Content Generation Sites

We've indexed several content generation sites that we want to
eliminate. One had hundreds of thousands of sub-domains spread across
several domains (up to 50M pages in total). Quite annoying.

First is to allow for cleaning up.  This consists of a new option to
"updatedb" which can scrub the database of all URLs which no longer
match URLFilter settings (regex-urlfilter.txt). This allows a change in
the urlfilter to be reflected against Nutches current dataset, something
I think others have asked for in the past.

Second is to treat a subdomain as being in the same bucket as the domain
for the generator.  This means that *.domain.com or *.domain.co.uk would
create 2 buckets instead of one per hostname. There is a high likely
hood that sub-domains will be on the same servers as the primary domain
and should be rate-limited as such.  generate.max.per.host would work
more as generate.max.per.domain instead.


Third is ongoing detection. I would like to add a feature to Nutch which
could report anomalies during updatedb or generate. For example, if any
given domain.com bucket during generate is found to have more than 5000
URLs to be downloaded, it should be flagged for a manual review. Write a
record to a text file which can be read and picked up by a person to
confirm that we haven't gotten into a garbage content generation site.
If we are in a content generation site, the person would add this domain
to the urlfilter and the next updatedb would clean out all URLs from
that location.


Are there any thoughts or objections to this? One and 2 are pretty
straight forward. Detection is not so easy.

--
Rod Taylor <[hidden email]>



Reply | Threaded
Open this post in threaded view
|

RE: Proposal for Avoiding Content Generation Sites

Rod Taylor-2
On Thu, 2006-03-09 at 21:51 +0200, Gal Nitzan wrote:
> Actually there is a property in conf: generate.max.per.host

That has proven to be problematic.

foo.domain.com
bar.domain.com
baz.domain.com
*** Repeat up to 4 Million times for some content generator sites ***

Each of these gets a different slot which effectively stalls everything
else.

Are there any objections to changing this to be one bucket per domain
instead of one per hostname?

> So if you add a message in Generator.java at the appropriate place... you
> have what you wish...


> -----Original Message-----
> From: Rod Taylor [mailto:[hidden email]]
> Sent: Wednesday, March 08, 2006 7:28 PM
> To: Nutch Developer List
> Subject: Proposal for Avoiding Content Generation Sites
>
> We've indexed several content generation sites that we want to
> eliminate. One had hundreds of thousands of sub-domains spread across
> several domains (up to 50M pages in total). Quite annoying.
>
> First is to allow for cleaning up.  This consists of a new option to
> "updatedb" which can scrub the database of all URLs which no longer
> match URLFilter settings (regex-urlfilter.txt). This allows a change in
> the urlfilter to be reflected against Nutches current dataset, something
> I think others have asked for in the past.
>
> Second is to treat a subdomain as being in the same bucket as the domain
> for the generator.  This means that *.domain.com or *.domain.co.uk would
> create 2 buckets instead of one per hostname. There is a high likely
> hood that sub-domains will be on the same servers as the primary domain
> and should be rate-limited as such.  generate.max.per.host would work
> more as generate.max.per.domain instead.
>
>
> Third is ongoing detection. I would like to add a feature to Nutch which
> could report anomalies during updatedb or generate. For example, if any
> given domain.com bucket during generate is found to have more than 5000
> URLs to be downloaded, it should be flagged for a manual review. Write a
> record to a text file which can be read and picked up by a person to
> confirm that we haven't gotten into a garbage content generation site.
> If we are in a content generation site, the person would add this domain
> to the urlfilter and the next updatedb would clean out all URLs from
> that location.
>
>
> Are there any thoughts or objections to this? One and 2 are pretty
> straight forward. Detection is not so easy.
>
--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Proposal for Avoiding Content Generation Sites

Doug Cutting
In reply to this post by Rod Taylor-2
Rod Taylor wrote:
> First is to allow for cleaning up.  This consists of a new option to
> "updatedb" which can scrub the database of all URLs which no longer
> match URLFilter settings (regex-urlfilter.txt). This allows a change in
> the urlfilter to be reflected against Nutches current dataset, something
> I think others have asked for in the past.

Yes, this would be a welcome addition.  Note that Andrzej recently
committed a change that causes Generate to filter urls, which achieves
the same effect, but w/o removing them from the database, so they're
still consuming space & time.

> Second is to treat a subdomain as being in the same bucket as the domain
> for the generator.  This means that *.domain.com or *.domain.co.uk would
> create 2 buckets instead of one per hostname. There is a high likely
> hood that sub-domains will be on the same servers as the primary domain
> and should be rate-limited as such.  generate.max.per.host would work
> more as generate.max.per.domain instead.

This could be implemented by adding a new plugin extension point for
hostname normalization.  The default implementation would be a no-op.

> Third is ongoing detection. I would like to add a feature to Nutch which
> could report anomalies during updatedb or generate. For example, if any
> given domain.com bucket during generate is found to have more than 5000
> URLs to be downloaded, it should be flagged for a manual review. Write a
> record to a text file which can be read and picked up by a person to
> confirm that we haven't gotten into a garbage content generation site.

A simple way to implement this would be to have the generator log each
host that exceeds the limit.  Then you can simply grep the logs for
these messages.  Good enough?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Proposal for Avoiding Content Generation Sites

Rod Taylor-2
On Thu, 2006-03-09 at 12:09 -0800, Doug Cutting wrote:

> Rod Taylor wrote:
> > First is to allow for cleaning up.  This consists of a new option to
> > "updatedb" which can scrub the database of all URLs which no longer
> > match URLFilter settings (regex-urlfilter.txt). This allows a change in
> > the urlfilter to be reflected against Nutches current dataset, something
> > I think others have asked for in the past.
>
> Yes, this would be a welcome addition.  Note that Andrzej recently
> committed a change that causes Generate to filter urls, which achieves
> the same effect, but w/o removing them from the database, so they're
> still consuming space & time.

Excellent. I'll put someone on this.

> > Second is to treat a subdomain as being in the same bucket as the domain
> > for the generator.  This means that *.domain.com or *.domain.co.uk would
> > create 2 buckets instead of one per hostname. There is a high likely
> > hood that sub-domains will be on the same servers as the primary domain
> > and should be rate-limited as such.  generate.max.per.host would work
> > more as generate.max.per.domain instead.
>
> This could be implemented by adding a new plugin extension point for
> hostname normalization.  The default implementation would be a no-op.

Reasonable enough.

> > Third is ongoing detection. I would like to add a feature to Nutch which
> > could report anomalies during updatedb or generate. For example, if any
> > given domain.com bucket during generate is found to have more than 5000
> > URLs to be downloaded, it should be flagged for a manual review. Write a
> > record to a text file which can be read and picked up by a person to
> > confirm that we haven't gotten into a garbage content generation site.
>
> A simple way to implement this would be to have the generator log each
> host that exceeds the limit.  Then you can simply grep the logs for
> these messages.  Good enough?

Good enough.

Thanks for the hints at direction.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Proposal for Avoiding Content Generation Sites

Andrzej Białecki-2
Rod Taylor wrote:

> On Thu, 2006-03-09 at 12:09 -0800, Doug Cutting wrote:
>  
>> Rod Taylor wrote:
>>    
>>> First is to allow for cleaning up.  This consists of a new option to
>>> "updatedb" which can scrub the database of all URLs which no longer
>>> match URLFilter settings (regex-urlfilter.txt). This allows a change in
>>> the urlfilter to be reflected against Nutches current dataset, something
>>> I think others have asked for in the past.
>>>      
>> Yes, this would be a welcome addition.  Note that Andrzej recently
>> committed a change that causes Generate to filter urls, which achieves
>> the same effect, but w/o removing them from the database, so they're
>> still consuming space & time.
>>    
>
> Excellent. I'll put someone on this.
>  

Stefan submitted some code that from my cursory glance looks good -
could you please check NUTCH-226 and see if it works for you? I plan to
run some tests and add this (plus a companion LinkDBFilter).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Proposal for Avoiding Content Generation Sites

Rod Taylor-2
> >>> First is to allow for cleaning up.  This consists of a new option to
> >>> "updatedb" which can scrub the database of all URLs which no longer
> >>> match URLFilter settings (regex-urlfilter.txt). This allows a change in
> >>> the urlfilter to be reflected against Nutches current dataset, something
> >>> I think others have asked for in the past.
> >>>      
> >> Yes, this would be a welcome addition.  Note that Andrzej recently
> >> committed a change that causes Generate to filter urls, which achieves
> >> the same effect, but w/o removing them from the database, so they're
> >> still consuming space & time.
> >>    
> >
> > Excellent. I'll put someone on this.

> Stefan submitted some code that from my cursory glance looks good -
> could you please check NUTCH-226 and see if it works for you? I plan to
> run some tests and add this (plus a companion LinkDBFilter).

I have about 100M urls to expunge and I want them completely gone.
Changing the status doesn't count from a performance perspective. Nutch
spends a significant amount of time in sorts and other logic on this
garbage during both generate and updatedb with every cycle.

Doing the actual expunging during updatedb is better than as a separate
command for performance. As a periodic option (scrubbing content
generation or abuse sites in my case) combining with updatedb will
reduce the IO and CPU requirements. Updatedb already reads in the DB,
cycles through every entry, sorts it, and write it out.


Doing this in a separate command would kill 4 to 8 hours of otherwise
usable time. Doing it as a part of updatedb probably costs about 1 hour
of work (CPU time to apply filters only).

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Proposal for Avoiding Content Generation Sites

Andrzej Białecki-2
Rod Taylor wrote:

> Doing the actual expunging during updatedb is better than as a separate
> command for performance. As a periodic option (scrubbing content
> generation or abuse sites in my case) combining with updatedb will
> reduce the IO and CPU requirements. Updatedb already reads in the DB,
> cycles through every entry, sorts it, and write it out.
>
>
> Doing this in a separate command would kill 4 to 8 hours of otherwise
> usable time. Doing it as a part of updatedb probably costs about 1 hour
> of work (CPU time to apply filters only).
>  

That's true. A separate tool might be useful anyway, but if you have
some spare cycles and could provide a patch to updatedb that implements
this it would be a great addition.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: Proposal for Avoiding Content Generation Sites

kkrugler
In reply to this post by Rod Taylor-2
>On Thu, 2006-03-09 at 21:51 +0200, Gal Nitzan wrote:
>>  Actually there is a property in conf: generate.max.per.host
>
>That has proven to be problematic.
>
>foo.domain.com
>bar.domain.com
>baz.domain.com
>*** Repeat up to 4 Million times for some content generator sites ***
>
>Each of these gets a different slot which effectively stalls everything
>else.
>
>Are there any objections to changing this to be one bucket per domain
>instead of one per hostname?

That sounds like a good idea.

 From what I remember when we did this, generating the base domain for
a URL is a bit of a fuzzy problem. Things like language code
suffixes, shortened versions of .com with some country codes
(.co.jp), etc.

Eventually we shifted to resolving domains to IP addresses. I think
there's been discussion of that on this list previously, to help
ensure threads on different TaskTracker nodes don't hit the same
server at the same time.

For the cases you've run into, do they resolve down to a limited
number of unique IP addresses?

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"