Relative urls - outlinks

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Relative urls - outlinks

webdev1977
Is there anyway to keep nutch from generating outlinks for any RELATIVE urls?  I basically don't want to use ANY relative urls that I find..

Then the next question is how do I get them out of my crawldb :-)
Reply | Threaded
Open this post in threaded view
|

RE: Relative urls - outlinks

Markus Jelsma-2
No, relative URL's are resolved in both parsers plugins. You can try to disable it manually. There's no way to remove them from the CrawlDB except some clever filtering. They're absolute now.

 
 
-----Original message-----

> From:webdev1977 <[hidden email]>
> Sent: Tue 18-Sep-2012 15:24
> To: [hidden email]
> Subject: Relative urls - outlinks
>
> Is there anyway to keep nutch from generating outlinks for any RELATIVE urls?
> I basically don't want to use ANY relative urls that I find..
>
> Then the next question is how do I get them out of my crawldb :-)
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Relative-urls-outlinks-tp4008601.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

RE: Relative urls - outlinks

webdev1977
NOOOooooo!!!  Just kidding! :-)  

So maybe you can clear something up for me.  In the future while building a new crawldb, if I only wanted to accept urls from the following:

http://myhost:81/site1/test.php?id=1234
http://myhost:81/site1/list.php?page=1234&count=21
http://myhost:81/site1/view.php?id=1234
http://myhost:81/site2/test2.php?id=12233
http://myhost:81/site2/list.php?page=25&count=12344

file:////sharedrive1/share1/

How would the regex-urlfilter look for the php pages?

+^http://myhost:81/site1/test.php\?.*    ???