relative urls

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

relative urls

Edward Quick

It looks to me like nutch doesn't handle pages with relative links. I have checked the FAQ and set outlinks to -1, but that makes no difference for my case.

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>


Here's an example of a relative url on my intranet home page:
<a class=cbl1 href="/general/apps/feedback.nsf/$Control/view+Feedback+-+By+Date">View by date</a>

Is there something I should configure to handle these?

Thanks for any help.

Ed.




_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: relative urls

Edward Quick

Looks like my theory was wrong here - sorry. Nutch is parsing relative links, it's just not crawling the page I'm checking below yet.

> From: [hidden email]
> To: [hidden email]
> Subject: relative urls
> Date: Wed, 10 Sep 2008 10:53:55 +0000
>
>
> It looks to me like nutch doesn't handle pages with relative links. I have checked the FAQ and set outlinks to -1, but that makes no difference for my case.
>
> <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
>   <description>The maximum number of outlinks that we'll process for a page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
>   will be processed for a page; otherwise, all outlinks will be processed.
>   </description>
> </property>
>
>
> Here's an example of a relative url on my intranet home page:
> <a class=cbl1 href="/general/apps/feedback.nsf/$Control/view+Feedback+-+By+Date">View by date</a>
>
> Is there something I should configure to handle these?
>
> Thanks for any help.
>
> Ed.
>
>
>
>
> _________________________________________________________________
> Win New York holidays with Kellogg’s & Live Search
> http://clk.atdmt.com/UKM/go/111354033/direct/01/

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: relative urls

Edward Quick

I worked out why. It was the crawl-urlfilter.txt/regex-urlfilter.txt line containing:

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

I think it's better to comment this out in the 'out of the box' version of nutch so the user gets everything first, then adds on the filter if they need it. This filter cut out literally thousands of urls for me.

Ed.


> From: [hidden email]
> To: [hidden email]
> Subject: RE: relative urls
> Date: Wed, 10 Sep 2008 15:43:08 +0000
>
>
> Looks like my theory was wrong here - sorry. Nutch is parsing relative links, it's just not crawling the page I'm checking below yet.
>
> > From: [hidden email]
> > To: [hidden email]
> > Subject: relative urls
> > Date: Wed, 10 Sep 2008 10:53:55 +0000
> >
> >
> > It looks to me like nutch doesn't handle pages with relative links. I have checked the FAQ and set outlinks to -1, but that makes no difference for my case.
> >
> > <property>
> >   <name>db.max.outlinks.per.page</name>
> >   <value>-1</value>
> >   <description>The maximum number of outlinks that we'll process for a page.
> >   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
> >   will be processed for a page; otherwise, all outlinks will be processed.
> >   </description>
> > </property>
> >
> >
> > Here's an example of a relative url on my intranet home page:
> > <a class=cbl1 href="/general/apps/feedback.nsf/$Control/view+Feedback+-+By+Date">View by date</a>
> >
> > Is there something I should configure to handle these?
> >
> > Thanks for any help.
> >
> > Ed.
> >
> >
> >
> >
> > _________________________________________________________________
> > Win New York holidays with Kellogg’s & Live Search
> > http://clk.atdmt.com/UKM/go/111354033/direct/01/
>
> _________________________________________________________________
> Make a mini you and download it into Windows Live Messenger
> http://clk.atdmt.com/UKM/go/111354029/direct/01/

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: relative urls

Kevin MacDonald-3
I agree. That line was also preventing nutch from following redirects in
certain cases.

Kevin

On Wed, Sep 10, 2008 at 9:05 AM, Edward Quick <[hidden email]>wrote:

>
> I worked out why. It was the crawl-urlfilter.txt/regex-urlfilter.txt line
> containing:
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> I think it's better to comment this out in the 'out of the box' version of
> nutch so the user gets everything first, then adds on the filter if they
> need it. This filter cut out literally thousands of urls for me.
>
> Ed.
>
>
> > From: [hidden email]
> > To: [hidden email]
> > Subject: RE: relative urls
> > Date: Wed, 10 Sep 2008 15:43:08 +0000
> >
> >
> > Looks like my theory was wrong here - sorry. Nutch is parsing relative
> links, it's just not crawling the page I'm checking below yet.
> >
> > > From: [hidden email]
> > > To: [hidden email]
> > > Subject: relative urls
> > > Date: Wed, 10 Sep 2008 10:53:55 +0000
> > >
> > >
> > > It looks to me like nutch doesn't handle pages with relative links. I
> have checked the FAQ and set outlinks to -1, but that makes no difference
> for my case.
> > >
> > > <property>
> > >   <name>db.max.outlinks.per.page</name>
> > >   <value>-1</value>
> > >   <description>The maximum number of outlinks that we'll process for a
> page.
> > >   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
> > >   will be processed for a page; otherwise, all outlinks will be
> processed.
> > >   </description>
> > > </property>
> > >
> > >
> > > Here's an example of a relative url on my intranet home page:
> > > <a class=cbl1
> href="/general/apps/feedback.nsf/$Control/view+Feedback+-+By+Date">View by
> date</a>
> > >
> > > Is there something I should configure to handle these?
> > >
> > > Thanks for any help.
> > >
> > > Ed.
> > >
> > >
> > >
> > >
> > > _________________________________________________________________
> > > Win New York holidays with Kellogg's & Live Search
> > > http://clk.atdmt.com/UKM/go/111354033/direct/01/
> >
> > _________________________________________________________________
> > Make a mini you and download it into Windows Live Messenger
> > http://clk.atdmt.com/UKM/go/111354029/direct/01/
>
> _________________________________________________________________
> Get all your favourite content with the slick new MSN Toolbar - FREE
> http://clk.atdmt.com/UKM/go/111354027/direct/01/
>
Reply | Threaded
Open this post in threaded view
|

Re: relative urls

Doğacan Güney-3
In reply to this post by Edward Quick
On Wed, Sep 10, 2008 at 7:05 PM, Edward Quick <[hidden email]>wrote:

>
> I worked out why. It was the crawl-urlfilter.txt/regex-urlfilter.txt line
> containing:
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> I think it's better to comment this out in the 'out of the box' version of
> nutch so the user gets everything first, then adds on the filter if they
> need it. This filter cut out literally thousands of urls for me.
>

Can you create a JIRA issue for this? It would be better to discuss it
there...


>
> Ed.
>
>
> > From: [hidden email]
> > To: [hidden email]
> > Subject: RE: relative urls
> > Date: Wed, 10 Sep 2008 15:43:08 +0000
> >
> >
> > Looks like my theory was wrong here - sorry. Nutch is parsing relative
> links, it's just not crawling the page I'm checking below yet.
> >
> > > From: [hidden email]
> > > To: [hidden email]
> > > Subject: relative urls
> > > Date: Wed, 10 Sep 2008 10:53:55 +0000
> > >
> > >
> > > It looks to me like nutch doesn't handle pages with relative links. I
> have checked the FAQ and set outlinks to -1, but that makes no difference
> for my case.
> > >
> > > <property>
> > >   <name>db.max.outlinks.per.page</name>
> > >   <value>-1</value>
> > >   <description>The maximum number of outlinks that we'll process for a
> page.
> > >   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
> > >   will be processed for a page; otherwise, all outlinks will be
> processed.
> > >   </description>
> > > </property>
> > >
> > >
> > > Here's an example of a relative url on my intranet home page:
> > > <a class=cbl1
> href="/general/apps/feedback.nsf/$Control/view+Feedback+-+By+Date">View by
> date</a>
> > >
> > > Is there something I should configure to handle these?
> > >
> > > Thanks for any help.
> > >
> > > Ed.
> > >
> > >
> > >
> > >
> > > _________________________________________________________________
> > > Win New York holidays with Kellogg's & Live Search
> > > http://clk.atdmt.com/UKM/go/111354033/direct/01/
> >
> > _________________________________________________________________
> > Make a mini you and download it into Windows Live Messenger
> > http://clk.atdmt.com/UKM/go/111354029/direct/01/
>
> _________________________________________________________________
> Get all your favourite content with the slick new MSN Toolbar - FREE
> http://clk.atdmt.com/UKM/go/111354027/direct/01/
>



--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: relative urls

Andrzej Białecki-2
Doğacan Güney wrote:

> On Wed, Sep 10, 2008 at 7:05 PM, Edward Quick <[hidden email]>wrote:
>
>> I worked out why. It was the crawl-urlfilter.txt/regex-urlfilter.txt line
>> containing:
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
>>
>> I think it's better to comment this out in the 'out of the box' version of
>> nutch so the user gets everything first, then adds on the filter if they
>> need it. This filter cut out literally thousands of urls for me.
>>
>
> Can you create a JIRA issue for this? It would be better to discuss it
> there...

Please do. The reason this filter is active out of the box is primarily
to protect new users from collecting infinite number of links from sites
that connect large databases to the web interface - and they usually use
meta-characters like these to select particular records. Also, common
spider traps use them too (such as calendars, search pages, forums,
etc). Since Nutch doesn't have any spider-trap detection mechanism
(yet), this is a crude way to control this problem.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com