Nutch spider trap detection

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch spider trap detection

brainstorm-2-2
Hi!

I guess it is implemented, but cannot find it by myself on nutch API
docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
detect spider traps[1] ?

Thanks,
Roman

[1] http://en.wikipedia.org/wiki/Spider_trap
Reply | Threaded
Open this post in threaded view
|

Re: Nutch spider trap detection

Dennis Kubes-2
There are some regexes in the url normalizers and there is some code in
DomContentUtils for recursion.

Dennis

brainstorm wrote:

> Hi!
>
> I guess it is implemented, but cannot find it by myself on nutch API
> docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
> detect spider traps[1] ?
>
> Thanks,
> Roman
>
> [1] http://en.wikipedia.org/wiki/Spider_trap
Reply | Threaded
Open this post in threaded view
|

Re: Nutch spider trap detection

brainstorm-2-2
Thanks ! I guess you mean:

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

In conf/regex-urlfilter.txt, am I wrong ?

The DomContentUtils on
/nutch/trunk/src/java/org/apache/nutch/parse/*.java is a bit confusing
to me and cannot see the recursion "protection" code.

Thanks !

On Mon, Jun 30, 2008 at 12:21 AM, Dennis Kubes <[hidden email]> wrote:

> There are some regexes in the url normalizers and there is some code in
> DomContentUtils for recursion.
>
> Dennis
>
> brainstorm wrote:
>>
>> Hi!
>>
>> I guess it is implemented, but cannot find it by myself on nutch API
>> docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
>> detect spider traps[1] ?
>>
>> Thanks,
>> Roman
>>
>> [1] http://en.wikipedia.org/wiki/Spider_trap
>