nutch scrawls only relative links

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

nutch scrawls only relative links

Denis Pimenov
Hello

I am a newbie in nutch...  It seems to me that scrawling is not working
by relative urls by default. How to fix it?

For example i have relative link on start page <a href="/test/my.jsp">  
is not scrawled(but browsers opens in with proper prefix) , but  if i
have link <a href="http://mydomain.com:8080/test/my.jsp"> it's crawled
well .. Is there any configuration file or something else to fix that?..
I have seen such question in mail archive but it wasn't answered

Denis Pimenov
Reply | Threaded
Open this post in threaded view
|

Re: nutch scrawls only relative links

Denis Pimenov
Denis Pimenov пишет:

I used this +^.* in crawl-urlfilter.txt, but it's don't working..it
doesn't crawl relative links, but only absolute...

> Hello
>
> I am a newbie in nutch...  It seems to me that scrawling is not
> working by relative urls by default. How to fix it?
>
> For example i have relative link on start page <a
> href="/test/my.jsp">  is not scrawled(but browsers opens in with
> proper prefix) , but  if i have link <a
> href="http://mydomain.com:8080/test/my.jsp"> it's crawled well .. Is
> there any configuration file or something else to fix that?.. I have
> seen such question in mail archive but it wasn't answered
>
> Denis Pimenov
>
>
Denis Pimenov

Reply | Threaded
Open this post in threaded view
|

RE: nutch scrawls only relative links

Alan Tanaman
Without looking too much at the source code, I assume this is down to the
handling of getOutlinks method in the DOMContentUtils class in the
parse-html plugin.

This method extracts outlinks from the DOM tree created from the HTML page.
These are then inserted into the crawldb for subsequent fetching.

Suggest that you try debugging that method to see what it does with such
anchors -- meaning what is the final content if any of such anchors (if no
one else has any more specific direction).

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
Tel: +44 (20) 7257 6125
Mobile: +44 (7796) 932 362
http://blog.idna-solutions.com

-----Original Message-----
From: Denis Pimenov [mailto:[hidden email]]
Sent: 24 January 2007 15:36
To: [hidden email]
Subject: Re: nutch scrawls only relative links

Denis Pimenov пишет:

I used this +^.* in crawl-urlfilter.txt, but it's don't working..it
doesn't crawl relative links, but only absolute...

> Hello
>
> I am a newbie in nutch...  It seems to me that scrawling is not
> working by relative urls by default. How to fix it?
>
> For example i have relative link on start page <a
> href="/test/my.jsp">  is not scrawled(but browsers opens in with
> proper prefix) , but  if i have link <a
> href="http://mydomain.com:8080/test/my.jsp"> it's crawled well .. Is
> there any configuration file or something else to fix that?.. I have
> seen such question in mail archive but it wasn't answered
>
> Denis Pimenov
>
>
Denis Pimenov