FW: invalid urls

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

FW: invalid urls

Edward Quick

The headers below might explain my problem a bit better. Nutch fetches the url and obviously bails out when it gets the code 500. However firefox appears to follow a redirect adding ?OpenDocument on the end of the url, and then gets a 200. Can I configure Nutch to get round this?
 
http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes
GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes HTTP/1.1Host: planetba.baplc.comUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-gb,en;q=0.5Accept-Encoding: gzip,deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 300Connection: keep-aliveCookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
HTTP/1.x 500 Internal Server ErrorServer: Lotus-DominoDate: Tue, 02 Sep 2008 21:35:52 GMTConnection: closeExpires: Tue, 01 Jan 1980 06:00:00 GMTContent-Type: text/html; charset=US-ASCIIContent-Length: 661Cache-Control: no-cache----------------------------------------------------------http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument
GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument HTTP/1.1Host: planetba.baplc.comUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-gb,en;q=0.5Accept-Encoding: gzip,deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 300Connection: keep-aliveCookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
HTTP/1.x 200 OKServer: Lotus-DominoDate: Tue, 02 Sep 2008 21:35:52 GMTLast-Modified: Tue, 02 Sep 2008 21:35:50 GMTExpires: Tue, 01 Jan 1980 06:00:00 GMTContent-Type: text/html; charset=ISO-8859-1Content-Length: 104168Cache-Control: no-cache



Hi, When I run a crawl on our intranet (which is run on a lotus notes domino server hence the stange urls), I get back a few error messages, most of them in the format below.  fetch of http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/VideoJavaScript/$FILE/)){this.addVariable( failed with: java.lang.IllegalArgumentException: Invalid uri 'http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/VideoJavaScript/$FILE/)){this.addVariable(': escaped absolute path not valid fetch of http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CPeople+%26+Training%5CAircraft+Maintenance+Training+%E2%80%93+A320+Single+Aisle+Family failed with: Http code=500, url=http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CPeople+%26+Training%5CAircraft+Maintenance+Training+%E2%80%93+A320+Single+Aisle+Family Is there anything I can configure in Nutch to handle these without filtering them out as they do appear to be legitimate pages? Thanks for any help. Rgds, Ed.




Try Facebook in Windows Live Messenger! Try it Now!
_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: FW: invalid urls

郑世强
First of all ,you need to use the plugin protocol-httpclient.
-----------------------------

> The headers below might explain my problem a bit better. Nutch fetches the url and obviously bails out when it gets the code 500. However firefox appears to follow a redirect adding ?OpenDocument on the end of the url, and then gets a 200. Can I configure Nutch to get round this?
>  
> http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes
> GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes HTTP/1.1Host: planetba.baplc.comUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-gb,en;q=0.5Accept-Encoding: gzip,deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 300Connection: keep-aliveCookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
> HTTP/1.x 500 Internal Server ErrorServer: Lotus-DominoDate: Tue, 02 Sep 2008 21:35:52 GMTConnection: closeExpires: Tue, 01 Jan 1980 06:00:00 GMTContent-Type: text/html; charset=US-ASCIIContent-Length: 661Cache-Control: no-cache----------------------------------------------------------http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument
> GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument HTTP/1.1Host: planetba.baplc.comUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-gb,en;q=0.5Accept-Encoding: gzip,deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 300Connection: keep-aliveCookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
> HTTP/1.x 200 OKServer: Lotus-DominoDate: Tue, 02 Sep 2008 21:35:52 GMTLast-Modified: Tue, 02 Sep 2008 21:35:50 GMTExpires: Tue, 01 Jan 1980 06:00:00 GMTContent-Type: text/html; charset=ISO-8859-1Content-Length: 104168Cache-Control: no-cache
>
>
>
> Hi, When I run a crawl on our intranet (which is run on a lotus notes domino server hence the stange urls), I get back a few error messages, most of them in the format below.  fetch of http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/VideoJavaScript/$FILE/)){this.addVariable( failed with: java.lang.IllegalArgumentException: Invalid uri 'http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/VideoJavaScript/$FILE/)){this.addVariable(': escaped absolute path not valid fetch of http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CPeople+%26+Training%5CAircraft+Maintenance+Training+%E2%80%93+A320+Single+Aisle+Family failed with: Http code=500, url=http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CPeople+%26+Training%5CAircraft+Maintenance+Training+%E2%80%93+A320+Single+Aisle+Family Is there anything I can configure in Nutch to handle these without filtering them out as they do appear to be legitimate pages? Thanks for any help. Rgds, Ed.
>
>
>
>
> Try Facebook in Windows Live Messenger! Try it Now!
> _________________________________________________________________
> Get all your favourite content with the slick new MSN Toolbar - FREE
> http://clk.atdmt.com/UKM/go/111354027/direct/01/


Reply | Threaded
Open this post in threaded view
|

RE: invalid urls

Edward Quick

I am using the protocol-httpclient though.

> Subject: Re: FW: invalid urls
> From: [hidden email]
> To: [hidden email]
> Date: Wed, 3 Sep 2008 09:56:19 +0800
>
> First of all ,you need to use the plugin protocol-httpclient.
> -----------------------------
> > The headers below might explain my problem a bit better. Nutch fetches the url and obviously bails out when it gets the code 500. However firefox appears to follow a redirect adding ?OpenDocument on the end of the url, and then gets a 200. Can I configure Nutch to get round this?
> >  
> > http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes
> > GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes HTTP/1.1Host: planetba.baplc.comUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-gb,en;q=0.5Accept-Encoding: gzip,deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 300Connection: keep-aliveCookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
> > HTTP/1.x 500 Internal Server ErrorServer: Lotus-DominoDate: Tue, 02 Sep 2008 21:35:52 GMTConnection: closeExpires: Tue, 01 Jan 1980 06:00:00 GMTContent-Type: text/html; charset=US-ASCIIContent-Length: 661Cache-Control: no-cache----------------------------------------------------------http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument
> > GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument HTTP/1.1Host: planetba.baplc.comUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Language: en-gb,en;q=0.5Accept-Encoding: gzip,deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 300Connection: keep-aliveCookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
> > HTTP/1.x 200 OKServer: Lotus-DominoDate: Tue, 02 Sep 2008 21:35:52 GMTLast-Modified: Tue, 02 Sep 2008 21:35:50 GMTExpires: Tue, 01 Jan 1980 06:00:00 GMTContent-Type: text/html; charset=ISO-8859-1Content-Length: 104168Cache-Control: no-cache
> >
> >
> >
> > Hi, When I run a crawl on our intranet (which is run on a lotus notes domino server hence the stange urls), I get back a few error messages, most of them in the format below.  fetch of http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/VideoJavaScript/$FILE/)){this.addVariable( failed with: java.lang.IllegalArgumentException: Invalid uri 'http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/VideoJavaScript/$FILE/)){this.addVariable(': escaped absolute path not valid fetch of http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CPeople+%26+Training%5CAircraft+Maintenance+Training+%E2%80%93+A320+Single+Aisle+Family failed with: Http code=500, url=http://planetba.baplc.com/general/aptrix/apteng.nsf/Content/Engineering+Home%5CPeople+%26+Training%5CAircraft+Maintenance+Training+%E2%80%93+A320+Single+Aisle+Family Is there anything I can configure in Nutch to handle these without filtering them out as they do appear to be legitimate pages? Thanks for any help. Rgds, Ed.
> >
> >
> >
> >
> > Try Facebook in Windows Live Messenger! Try it Now!
> > _________________________________________________________________
> > Get all your favourite content with the slick new MSN Toolbar - FREE
> > http://clk.atdmt.com/UKM/go/111354027/direct/01/
>
>

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/