Outlinks not being processed

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Outlinks not being processed

Kevin MacDonald-3
I am virtually certain that no outlinks exist in the link database following
a crawl. When crawling a site such as "foo.com" I wind up with all InLinks
there in the db, but OutLinks (any link leading off the foo.com domain) do
not appear in the database. My configuration settings are basically this:

nutch-default.xml
'db.ignore.external.links' is set to false

crawl-urlfilter.xml
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in foo.com
+^http://([a-z0-9]*\.)*foo.com/

# skip everything else
-.

The crawl is done using

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

The link dump is done using

bin/nutch readlinkdb crawl/linkdb -dump links

If anyone can help me resolve this issue I would very much appreciate it.

Thanks,

Kevin
Reply | Threaded
Open this post in threaded view
|

Re: Outlinks not being processed

Amitabha Banerjee
I believer +^http://([a-z0-9]*\.)*foo.com/ is filtering out all URLs except
those which are in the foo domain.

On 9/9/08, Kevin MacDonald <[hidden email]> wrote:

>
> I am virtually certain that no outlinks exist in the link database
> following
> a crawl. When crawling a site such as "foo.com" I wind up with all InLinks
> there in the db, but OutLinks (any link leading off the foo.com domain) do
> not appear in the database. My configuration settings are basically this:
>
> nutch-default.xml
> 'db.ignore.external.links' is set to false
>
> crawl-urlfilter.xml
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept hosts in foo.com
> +^http://([a-z0-9]*\.)*foo.com/
>
> # skip everything else
> -.
>
> The crawl is done using
>
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
> The link dump is done using
>
> bin/nutch readlinkdb crawl/linkdb -dump links
>
> If anyone can help me resolve this issue I would very much appreciate it.
>
> Thanks,
>
>
> Kevin
>
Reply | Threaded
Open this post in threaded view
|

Re: Outlinks not being processed

Kevin MacDonald-3
That prevents crawling of URLs outside the foo domain, but should NOT
prevent storing of links in the database. That is my understanding from
reading this:
http://facstaff.unca.edu/mcmcclur/class/Seminar/Pagerank/nutch/nutch.html.



On Tue, Sep 9, 2008 at 10:30 AM, Amitabha Banerjee <[hidden email]>wrote:

> I believer +^http://([a-z0-9]*\.)*foo.com/ is filtering out all URLs
> except
> those which are in the foo domain.
>
> On 9/9/08, Kevin MacDonald <[hidden email]> wrote:
> >
> > I am virtually certain that no outlinks exist in the link database
> > following
> > a crawl. When crawling a site such as "foo.com" I wind up with all
> InLinks
> > there in the db, but OutLinks (any link leading off the foo.com domain)
> do
> > not appear in the database. My configuration settings are basically this:
> >
> > nutch-default.xml
> > 'db.ignore.external.links' is set to false
> >
> > crawl-urlfilter.xml
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >
> > # accept hosts in foo.com
> > +^http://([a-z0-9]*\.)*foo.com/
> >
> > # skip everything else
> > -.
> >
> > The crawl is done using
> >
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> >
> > The link dump is done using
> >
> > bin/nutch readlinkdb crawl/linkdb -dump links
> >
> > If anyone can help me resolve this issue I would very much appreciate it.
> >
> > Thanks,
> >
> >
> > Kevin
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Outlinks not being processed

Kevin MacDonald-3
It appears however, that you are correct and the document I was reading was
incorrect. I was too trusting. I am now using:
+^http://([a-z0-9]*\.)*\S*/

and that appears to work. Thanks for putting me on the right track.

Kevin

On Tue, Sep 9, 2008 at 11:25 AM, Kevin MacDonald <[hidden email]>wrote:

> That prevents crawling of URLs outside the foo domain, but should NOT
> prevent storing of links in the database. That is my understanding from
> reading this:
> http://facstaff.unca.edu/mcmcclur/class/Seminar/Pagerank/nutch/nutch.html.
>
>
>
>
> On Tue, Sep 9, 2008 at 10:30 AM, Amitabha Banerjee <[hidden email]>wrote:
>
>> I believer +^http://([a-z0-9]*\.)*foo.com/ is filtering out all URLs
>> except
>> those which are in the foo domain.
>>
>> On 9/9/08, Kevin MacDonald <[hidden email]> wrote:
>> >
>> > I am virtually certain that no outlinks exist in the link database
>> > following
>> > a crawl. When crawling a site such as "foo.com" I wind up with all
>> InLinks
>> > there in the db, but OutLinks (any link leading off the foo.com domain)
>> do
>> > not appear in the database. My configuration settings are basically
>> this:
>> >
>> > nutch-default.xml
>> > 'db.ignore.external.links' is set to false
>> >
>> > crawl-urlfilter.xml
>> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> > loops
>> > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> >
>> > # accept hosts in foo.com
>> > +^http://([a-z0-9]*\.)*foo.com/
>> >
>> > # skip everything else
>> > -.
>> >
>> > The crawl is done using
>> >
>> > bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>> >
>> > The link dump is done using
>> >
>> > bin/nutch readlinkdb crawl/linkdb -dump links
>> >
>> > If anyone can help me resolve this issue I would very much appreciate
>> it.
>> >
>> > Thanks,
>> >
>> >
>> > Kevin
>> >
>>
>
>