Duplicate Content Issues

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Duplicate Content Issues

Jack.Tang
Hi

How to avoid duplicate content?
1. Mirror sites: 1 website, 2 domains.
2. Confusing the bot: dynamic URL's. As robots find dynamic content,
the site may be returning a different URL with the same content…
3. Print friendly pages?

Will nutch enhanced the dedup code?
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Content Issues

Jérôme Charron
> How to avoid duplicate content?

You can use the org.apache.nutch.crawl.TextProfileSignature implementation
instead of the default MD5Signature or provide your own Signature
implementation.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/