description of db.ignore.internal.links property

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

description of db.ignore.internal.links property

Vineet Garg-3
Hi,

What does db.ignore.internal.links property in nutch-default.xml do?

<property>
  <name>db.ignore.internal.links</name>
  <value>true</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

1. Does it effect the page rank by getting into account more pages when it creates the page rank, or
2. It effects indexing by indexing more pages and therefore returns more results when searching
later on.


Can anybody please explain it?


Regards,
Vineet Garg

Reply | Threaded
Open this post in threaded view
|

Re: description of db.ignore.internal.links property

Dennis Kubes-2


Vineet Garg wrote:

> Hi,
>
> What does db.ignore.internal.links property in nutch-default.xml do?
>
> <property>
>  <name>db.ignore.internal.links</name>
>  <value>true</value>
>  <description>If true, when adding new links to a page, links from
>  the same host are ignored.  This is an effective way to limit the
>  size of the link database, keeping only the highest quality
>  links.
>  </description>
> </property>
>

If true it will NOT store links in a domain that point to the same
domain.  For example a link and page www.domain.com/a.html that points
to www.domain.com/b.html.  This significantly decreases the number of
links being stored in the link database.

> 1. Does it effect the page rank by getting into account more pages when
> it creates the page rank, or

Yes because by default internal links are scored the same as external
links.  For large web crawls this will throw off results because pages
with more internal links can get higher rankings.  I have found that on
larger web crawls it is best to ignore internal links and to set
db.score.link.internal to a very low value or 0.

> 2. It effects indexing by indexing more pages and therefore returns more
> results when searching later on.

No it doesn't affect the links being stored in crawldb and later
fetched.  It only affects linkdb and the eventual scoring process.

Dennis
>
>
> Can anybody please explain it?
>
>
> Regards,
> Vineet Garg
>