[jira] Created: (NUTCH-286) Handling common error-pages as 404

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-286) Handling common error-pages as 404

Sebastian Nagel (Jira)
Handling common error-pages as 404
----------------------------------

         Key: NUTCH-286
         URL: http://issues.apache.org/jira/browse/NUTCH-286
     Project: Nutch
        Type: Improvement

    Reporter: Stefan Neufeind


Idea: Some pages from some software-packages/scripts report an "http 200 ok" even though a specific page could not be found. Example I just found  is:
http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
That's a typo3-page explaining in it's standard-layout and wording: "The requested page did not exist or was inaccessible."

So I had the idea if somebody might create a plugin that could find commonly used formulations for "page does not exist" etc. and turn the page into a 404 before feeding them  into the nutch-index  - although the server responded with status 200 ok.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-286) Handling common error-pages as 404

Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ]

Stefan Groschupf commented on NUTCH-286:
----------------------------------------

This is difficult to realize since the http error code is readed from response in the fetcher and setted into the protocol status , content analysis can only done during parsing.
Also normally such pages do not get a high OPIC score and should be not in the top search results.
However this is a wrong configured http server response, so you may should open a bug in the typo3 issue tracking.
Should we close this issue?

> Handling common error-pages as 404
> ----------------------------------
>
>          Key: NUTCH-286
>          URL: http://issues.apache.org/jira/browse/NUTCH-286
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> Idea: Some pages from some software-packages/scripts report an "http 200 ok" even though a specific page could not be found. Example I just found  is:
> http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
> That's a typo3-page explaining in it's standard-layout and wording: "The requested page did not exist or was inaccessible."
> So I had the idea if somebody might create a plugin that could find commonly used formulations for "page does not exist" etc. and turn the page into a 404 before feeding them  into the nutch-index  - although the server responded with status 200 ok.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-286) Handling common error-pages as 404

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414464 ]

Stefan Neufeind commented on NUTCH-286:
---------------------------------------

Well, we _could_  close it, though the question still remains for me. The problem imho is that you say it's hard to do.
For sure you could always write searches to prune those pages from the index - but I wonder if that's a clean solution or if it would be better to have a way of excluding certain pages (like these common error-pages, though their header is wrong). I guess it's the typical problem when crawling the web: Technician will say  "that webserver/typo3 is wrong and is to be fixed" - but management will not care, and you will have to solve the problem in  whatever way.

> Handling common error-pages as 404
> ----------------------------------
>
>          Key: NUTCH-286
>          URL: http://issues.apache.org/jira/browse/NUTCH-286
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> Idea: Some pages from some software-packages/scripts report an "http 200 ok" even though a specific page could not be found. Example I just found  is:
> http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
> That's a typo3-page explaining in it's standard-layout and wording: "The requested page did not exist or was inaccessible."
> So I had the idea if somebody might create a plugin that could find commonly used formulations for "page does not exist" etc. and turn the page into a 404 before feeding them  into the nutch-index  - although the server responded with status 200 ok.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-286) Handling common error-pages as 404

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-286?page=all ]
     
Stefan Groschupf closed NUTCH-286:
----------------------------------

    Resolution: Won't Fix

I hope everybody agree with the statement: We can not detect http response codes based on responded html content.
Prune the index is a good idea to solve the problem.

> Handling common error-pages as 404
> ----------------------------------
>
>          Key: NUTCH-286
>          URL: http://issues.apache.org/jira/browse/NUTCH-286
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> Idea: Some pages from some software-packages/scripts report an "http 200 ok" even though a specific page could not be found. Example I just found  is:
> http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
> That's a typo3-page explaining in it's standard-layout and wording: "The requested page did not exist or was inaccessible."
> So I had the idea if somebody might create a plugin that could find commonly used formulations for "page does not exist" etc. and turn the page into a 404 before feeding them  into the nutch-index  - although the server responded with status 200 ok.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira