howto skip hiddens ulrs inside div tag?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

howto skip hiddens ulrs inside div tag?

Massimo Miccoli
Hi nutch dev,

After fetching about 100 mio of pages I see many search engine spammers
that use an hidden div tag (negative position) to include many urls
that user don't see whe acces the site page. This links alter the boost
(by inlink count) so I want to skip this urls.
How can I do that?

Thanks,

Massimo

Reply | Threaded
Open this post in threaded view
|

Re: howto skip hiddens ulrs inside div tag?

Andrzej Białecki-2
Massimo Miccoli wrote:
> Hi nutch dev,
>
> After fetching about 100 mio of pages I see many search engine spammers
> that use an hidden div tag (negative position) to include many urls
> that user don't see whe acces the site page. This links alter the boost
> (by inlink count) so I want to skip this urls.
> How can I do that?

Implement an HtmlParseFilter, similar to creativecommons plugin. This
plugin will remove matching tags.

In fact, if you have some spare cycles, you could implement a more
generic "html cleanup" plugin, where you could specify a list of XPaths
to match (and optionally replace).

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com