Working with the Link database

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Working with the Link database

Kevin MacDonald-3
I am trying to extract all links from a given web page. I am crawling
www.brick.com as an example by doing the following:
./bin/nutch -core crawl bin/urls -dir brickdata -depth 1

This works fine, but when I dump the links like so:

./bin/nutch -core readlinkdb brickdata/linkdb -dump bricklinksdump

what I get are only 'InLinks'. There is a link on that page which is '
www.americantile.com'. I am not expecting Nutch to crawl that page because I
have only set a depth of 1 and because my crawl-urlfilter is set to
+^http://([a-z0-9]*\.)*brick.com/

but I was expecting to see 'www.americantile.com' in my dump of the link
database. My understanding was that regardless of how the scope of the crawl
is limited external links would still appear in the link database. Is there
a configuration change I need to make to allow this to happen?

Thanks!

Kevin