Generator taking time

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Generator taking time

James Ford
Hello,

I am having problems with the Generator step of my crawls. It takes a lot of time compared to indexing and fetching? Right now the generator step is taking about 50min compared to fetching, parsing and indexing that only takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up the time:

2012-03-22 11:13:28,277 INFO  regex.RegexURLNormalizer - can't find rules for scope 'partition', using default
2012-03-22 11:16:00,734 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000

Crawldb dump:

20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - Statistics for CrawlDb: crawldb/                                                                  
20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - TOTAL urls: 7819485                                                                                
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 0:    7811052                                                                                
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 1:    2994                                                                                  
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 2:    1214                                                                                  
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 3:    1125                                                                                  
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 4:    1124                                                                                  
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 5:    1303                                                                                  
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 6:    673                                                                                    
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - min score:  0.0                                                                                    
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - avg score:  0.0015287232                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - max score:  2.0                                                                                    
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 1 (db_unfetched):    6946135                                                                
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 2 (db_fetched):      795070                                                                
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 3 (db_gone): 34358                                                                          
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 4 (db_redir_temp):   21861                                                                  
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 5 (db_redir_perm):   22044                                                                  
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 6 (db_notmodified):  17                                                                    
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - CrawlDb statistics: done  

Does anyone have a clue how to fix this?
Reply | Threaded
Open this post in threaded view
|

Re: Generator taking time

Markus Jelsma-2
If the state of your CrawlDB is already normalized then do not use a
normalizer unless your really have to. Same is true for filtering in this
step.

On Thursday 22 March 2012 11:48:40 James Ford wrote:

> Hello,
>
> I am having problems with the Generator step of my crawls. It takes a lot
> of time compared to indexing and fetching? Right now the generator step is
> taking about 50min compared to fetching, parsing and indexing that only
> takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up
> the time:
>
> 2012-03-22 11:13:28,277 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'partition', using default
> 2012-03-22 11:16:00,734 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
>
> Crawldb dump:
>
> 20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - Statistics for
> CrawlDb: crawldb/
> 20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - TOTAL urls: 7819485
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 0:    7811052
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 1:    2994
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 2:    1214
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 3:    1125
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 4:    1124
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 5:    1303
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 6:    673
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - min score:  0.0
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - avg score:
> 0.0015287232
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - max score:  2.0
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 1
> (db_unfetched):    6946135
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 2
> (db_fetched):      795070
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 3 (db_gone):
> 34358
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 4
> (db_redir_temp):   21861
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 5
> (db_redir_perm):   22044
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 6
> (db_notmodified):  17
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - CrawlDb statistics:
> done
>
> Does anyone have a clue how to fix this?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848106
> .html Sent from the Nutch - User mailing list archive at Nabble.com.

--
Markus Jelsma - CTO - Openindex
Reply | Threaded
Open this post in threaded view
|

Re: Generator taking time

James Ford
Thanks for answer Markus,

But I don't think I follow you. I am new to nutch. How could I make nutch use the normalizer only when I have to? I tried removing the order of the normalizers in the config, but nothing happened.
Reply | Threaded
Open this post in threaded view
|

Re: Generator taking time

Markus Jelsma-2
bin/nutch generate
Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers
numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]

Use the noNorm and likely the noFilter option as well. But again, only do this
if you are sure the state of the CrawlDB is already normalized and properly
filtered.



On Thursday 22 March 2012 12:18:24 James Ford wrote:

> Thanks for answer Markus,
>
> But I don't think I follow you. I am new to nutch. How could I make nutch
> use the normalizer only when I have to? I tried removing the order of the
> normalizers in the config, but nothing happened.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848158
> .html Sent from the Nutch - User mailing list archive at Nabble.com.

--
Markus Jelsma - CTO - Openindex
Reply | Threaded
Open this post in threaded view
|

Re: Generator taking time

Greg Fields
I have the same same problem. I have ~5000 urls in my seed and fetch 15000 pages each iteration. The fetching/indexing time is fast but the time for running the RegexURLNormalizer doubles for each iteration. When should I use the [-noFilter] [-noNorm] flag? Does the normalizer go through the whole unfetched-list every time?