Deleting file: urls from crawldb that give 404 status

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Deleting file: urls from crawldb that give 404 status

webdev1977
I am having an issue with removing deleted file: urls on subsequent crawls.  It stays with a status of db_unfetched and doesn't seem to want to use the 404 (db_gone) status.  This means that I can't run solrclean to get rid of the old file: urls.  

I poked around in the protocol-file code and made some changes ot the ProtocolOutput class to force a 404 if a file url has been deleted.  It didn't seem to make a difference when it was fetched however.

Any ideas how to get rid of deleted file: urls?
Reply | Threaded
Open this post in threaded view
|

RE: Deleting file: urls from crawldb that give 404 status

Markus Jelsma-2
Sounds like:
https://issues.apache.org/jira/browse/NUTCH-1245 
 
Also, with a recent Nutch you can index with a -deleteGone flag. It behaves similar to SolrClean but only on records you just fetched.

-----Original message-----

> From:webdev1977 <[hidden email]>
> Sent: Tue 19-Jun-2012 21:40
> To: [hidden email]
> Subject: Deleting file: urls from crawldb that give 404 status
>
> I am having an issue with removing deleted file: urls on subsequent crawls.
> It stays with a status of db_unfetched and doesn't seem to want to use the
> 404 (db_gone) status.  This means that I can't run solrclean to get rid of
> the old file: urls.  
>
> I poked around in the protocol-file code and made some changes ot the
> ProtocolOutput class to force a 404 if a file url has been deleted.  It
> didn't seem to make a difference when it was fetched however.
>
> Any ideas how to get rid of deleted file: urls?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Deleting-file-urls-from-crawldb-that-give-404-status-tp3990391.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>