fetch an ammeded url

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

fetch an ammeded url

Edward Quick
Hi

Please can someone point me in the right direction. I have a problem when scanning our intranet because many of the pages return code 500 as illustrated in the headers below, which (correctly I agree) gives httpclient the impression the GET failed. However the server actually redirects the GET by appending "?OpenDocument" on the end of the initial url requested.

I don't think there's a way to get round this in the configuration so I looked at fetcher.java and tried to get it to refetch the url with "?OpenDocument" appended but my code didn't work. I can't really figure out how it works! duh! Could someone tell me how to get nutch to refetch the ammended url please if httpclient gets a 500 back?

Thanks,

Ed.

http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes
GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes HTTP/1.1
Host: planetba.baplc.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
HTTP/1.x 500 Internal Server Error
Server: Lotus-Domino
Date: Tue, 02 Sep 2008 21:35:52 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 661
Cache-Control: no-cache


----------------------------------------------------------
http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument
GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument HTTP/1.1
Host: planetba.baplc.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
HTTP/1.x 200 OK
Server: Lotus-Domino
Date: Tue, 02 Sep 2008 21:35:52 GMT
Last-Modified: Tue, 02 Sep 2008 21:35:50 GMT
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=ISO-8859-1
Content-Length: 104168
Cache-Control: no-cache

Get Hotmail on your mobile from Vodafone Try it Now!
Reply | Threaded
Open this post in threaded view
|

RE: fetch an ammeded url

Edward Quick
First of all, sorry about the spelling mistake in the subject heading (ammended).

I've got a bit further with my problem now. Here's some code I added in fetcher.java to handle the error 500 returned by the domino server:

case ProtocolStatus.EXCEPTION:

                newUrl=url.toString();
                if(!newUrl.endsWith("OpenDocument")){

                        status.setCode(12);
                        output(url, datum, content, status, CrawlDatum.STATUS_FETCH_REDIR_PERM);

                        newUrl = newUrl + "?OpenDocument";
                        url = new Text(newUrl);
                        redirecting = true;
                        redirectCount++;
                        output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_DB_UNFETCHED);

                        break;
                }else{
                        logError(url, status.getMessage());
                }


I can see Nutch retrying the ammended urls, but when the crawl finishes, I can't see these pages in my search. Any ideas?

Ed.


From: [hidden email]
To: [hidden email]
Subject: fetch an ammeded url
Date: Wed, 3 Sep 2008 19:43:39 +0000

Hi

Please can someone point me in the right direction. I have a problem when scanning our intranet because many of the pages return code 500 as illustrated in the headers below, which (correctly I agree) gives httpclient the impression the GET failed. However the server actually redirects the GET by appending "?OpenDocument" on the end of the initial url requested.

I don't think there's a way to get round this in the configuration so I looked at fetcher.java and tried to get it to refetch the url with "?OpenDocument" appended but my code didn't work. I can't really figure out how it works! duh! Could someone tell me how to get nutch to refetch the ammended url please if httpclient gets a 500 back?

Thanks,

Ed.

http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes
GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes HTTP/1.1
Host: planetba.baplc.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
HTTP/1.x 500 Internal Server Error
Server: Lotus-Domino
Date: Tue, 02 Sep 2008 21:35:52 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 661
Cache-Control: no-cache


----------------------------------------------------------
http://planetba.baplc.com/general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument
GET /general/aptrix/aptprop.nsf/Content/Europe+%26+Africa+Home%5CLibrary%5C500+EA+LocCodes?OpenDocument HTTP/1.1
Host: planetba.baplc.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: ObSSOCookie=DdKzZ2Ebcglw9MjchanSFA%2FKN0agvrJTAe6PEGDHOXTeEgfmCrvqYCxVBY0qwU24Xb2T6MV3%2BUwrIfNhKVQA97J54%2Fd2%2BjetZjNoC98N4638eJpf3ZDyE50llsTdOAADaNn%2BjqVfeFrvDjJ2agM1Pxo1Y7DGR0yME1P0%2FHcd6XgFaHwEq9CyUvPq5k6mKMr7Vy4oiZS75RRPAJwNTOxoj7cLuwHX%2Fugj2GJ%2F8Jdynj6Ov1rxgeCWqGdm1ltqEma1TkAbKayt8RtilHwZxRmYDRc3tnGlaqauVUZDNVNE3B3L3bQDyfaFWaDHuX3r67CP
HTTP/1.x 200 OK
Server: Lotus-Domino
Date: Tue, 02 Sep 2008 21:35:52 GMT
Last-Modified: Tue, 02 Sep 2008 21:35:50 GMT
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=ISO-8859-1
Content-Length: 104168
Cache-Control: no-cache

Get Hotmail on your mobile from Vodafone Try it Now!

Get Hotmail on your mobile from Vodafone Try it Now!