inject deletes urls from crawldb

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

inject deletes urls from crawldb

Michael Coffey
Perhaps my strangest question yet!
Why does Inject delete URLs from the crawldb and how can I prevent it?
I was trying to add 2 new sites to an existing crawldb. According to readdb stats, about 10% of my URLs disappeared in the process.

(before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched):      346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    65
(after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched):      318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    49
My command line is like this
$NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb /crawls/$crawlspace/seeds_nbcnews.txt
Does it apply urlfilters as it injects?
Reply | Threaded
Open this post in threaded view
|

RE: inject deletes urls from crawldb

Markus Jelsma-2
filters and/or normalizers come to mind!

 
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Thursday 28th September 2017 4:40
> To: User <[hidden email]>
> Subject: inject deletes urls from crawldb
>
> Perhaps my strangest question yet!
> Why does Inject delete URLs from the crawldb and how can I prevent it?
> I was trying to add 2 new sites to an existing crawldb. According to readdb stats, about 10% of my URLs disappeared in the process.
>
> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched):      346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    65
> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched):      318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    49
> My command line is like this
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb /crawls/$crawlspace/seeds_nbcnews.txt
> Does it apply urlfilters as it injects?
>
Reply | Threaded
Open this post in threaded view
|

Re: inject deletes urls from crawldb

Michael Coffey
If the Inject command does filtering, then the documentation should say so. The page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention any filtering or normalization. I find it very counter-intuitive that an injection operation would delete existing data.

Should I edit that page? Can I?


      From: Markus Jelsma <[hidden email]>
 To: "[hidden email]" <[hidden email]>; User <[hidden email]>
 Sent: Thursday, September 28, 2017 2:06 AM
 Subject: RE: inject deletes urls from crawldb
   
filters and/or normalizers come to mind!

 
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Thursday 28th September 2017 4:40
> To: User <[hidden email]>
> Subject: inject deletes urls from crawldb
>
> Perhaps my strangest question yet!
> Why does Inject delete URLs from the crawldb and how can I prevent it?
> I was trying to add 2 new sites to an existing crawldb. According to readdb stats, about 10% of my URLs disappeared in the process.
>
> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched):      346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    65
> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched):      318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    49
> My command line is like this
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb /crawls/$crawlspace/seeds_nbcnews.txt
> Does it apply urlfilters as it injects?
>

   
Reply | Threaded
Open this post in threaded view
|

Re: inject deletes urls from crawldb

Sebastian Nagel
Hi Michael,

that's actually due to a bug introduced with Nutch 1.12 and already fixed for Nutch 1.14, see
  https://issues.apache.org/jira/browse/NUTCH-2335

Thanks,
Sebastian

On 09/28/2017 07:26 PM, Michael Coffey wrote:

> If the Inject command does filtering, then the documentation should say so. The page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention any filtering or normalization. I find it very counter-intuitive that an injection operation would delete existing data.
>
> Should I edit that page? Can I?
>
>
>       From: Markus Jelsma <[hidden email]>
>  To: "[hidden email]" <[hidden email]>; User <[hidden email]>
>  Sent: Thursday, September 28, 2017 2:06 AM
>  Subject: RE: inject deletes urls from crawldb
>    
> filters and/or normalizers come to mind!
>
>  
>  
> -----Original message-----
>> From:Michael Coffey <[hidden email]>
>> Sent: Thursday 28th September 2017 4:40
>> To: User <[hidden email]>
>> Subject: inject deletes urls from crawldb
>>
>> Perhaps my strangest question yet!
>> Why does Inject delete URLs from the crawldb and how can I prevent it?
>> I was trying to add 2 new sites to an existing crawldb. According to readdb stats, about 10% of my URLs disappeared in the process.
>>
>> (before injecting)17/09/27 19:22:33 INFO crawl.CrawlDbReader: TOTAL urls: 2484917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    2004717/09/27 19:22:33 INFO crawl.CrawlDbReader: status 2 (db_fetched):      346517/09/27 19:22:33 INFO crawl.CrawlDbReader: status 3 (db_gone): 40217/09/27 19:22:33 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   77917/09/27 19:22:33 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:22:33 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    65
>> (after injecting)17/09/27 19:26:15 INFO crawl.CrawlDbReader: TOTAL urls: 2240517/09/27 19:26:15 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    1901417/09/27 19:26:15 INFO crawl.CrawlDbReader: status 2 (db_fetched):      318717/09/27 19:26:15 INFO crawl.CrawlDbReader: status 3 (db_gone): 3617/09/27 19:26:15 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   2817/09/27 19:26:15 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9117/09/27 19:26:15 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    49
>> My command line is like this
>> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=1 -D db.fetch.interval.default=3600 /crawls/$crawlspace/data/crawldb /crawls/$crawlspace/seeds_nbcnews.txt
>> Does it apply urlfilters as it injects?
>>
>
>    
>