Map reducer filtering too many sites during generation in Nutch 2.4

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Map reducer filtering too many sites during generation in Nutch 2.4

Makkara Mestari


Hello
 
I have injected about 1300 domains to the seed list.
 
First two fetches work nicely, but after that, the crawler will only select urls from a few domains, leaving all other urls permanently with the status 1 (unfetched), which number in tens of thousands. Currently the generator only generates the same 4 urls every time, that are unreachable pages.
 
Im not sure, but I believe that this is the fault of the reducer, here is a sample of output during generation phase with setting -topN  50000
 

2019-11-13 13:22:29,186 INFO  mapreduce.Job - Job job_local1940214525_0001 completed successfully
2019-11-13 13:22:29,210 INFO  mapreduce.Job - Counters: 34
        File System Counters
                FILE: Number of bytes read=1313864
                FILE: Number of bytes written=1904695
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=22048
                Map output records=4
                Map output bytes=584
                Map output materialized bytes=599
                Input split bytes=953
                Combine input records=0
                Combine output records=0
                Reduce input groups=4
                Reduce shuffle bytes=599
                Reduce input records=4
                Reduce output records=4
                Spilled Records=8
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=22
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=902823936
        Generator
                GENERATE_MARK=4
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0
2019-11-13 13:22:29,238 INFO  crawl.GeneratorJob - GeneratorJob: finished at 2019-11-13 13:22:29, time elapsed: 00:00:04
2019-11-13 13:22:29,238 INFO  crawl.GeneratorJob - GeneratorJob: generated batch id: 1573651344-1856402192 containing 4 URLs
 
If I try resetting the crawldb, and injecting only one of the domains, then I can crawl it compleately fine, this problem of never fetched pages only arises if I try to work with a moderate amount of domains at the time (1300 in this case).
 
Is this a known problem of Nutch 2.4, or have I just misconfigured something?
 
-Makkara
Reply | Threaded
Open this post in threaded view
|

Re: Map reducer filtering too many sites during generation in Nutch 2.4

Sebastian Nagel-2
Hi Makkara,

> but I believe that this is the fault of the reducer
>                 Map input records=22048
>                 Map output records=4

The items are skipped in the mapper.

> Is this a known problem of Nutch 2.4, or have I just misconfigured
> something?

Could be the configuration or a bug in the storage layer causing not all items of the web table sent
to the mapper.

Please also note that we expect that 2.4 is the last release on the 2.X series. We've decided to
freeze the development on the 2.X branch for now, as no committer is actively working on it. Nutch
1.x is actively maintained.

Best,
Sebastian

On 11/13/19 3:42 PM, Makkara Mestari wrote:

>
>
> Hello
>  
> I have injected about 1300 domains to the seed list.
>  
> First two fetches work nicely, but after that, the crawler will only select urls from a few domains, leaving all other urls permanently with the status 1 (unfetched), which number in tens of thousands. Currently the generator only generates the same 4 urls every time, that are unreachable pages.
>  
> Im not sure, but I believe that this is the fault of the reducer, here is a sample of output during generation phase with setting -topN  50000
>  
>
> 2019-11-13 13:22:29,186 INFO  mapreduce.Job - Job job_local1940214525_0001 completed successfully
> 2019-11-13 13:22:29,210 INFO  mapreduce.Job - Counters: 34
>         File System Counters
>                 FILE: Number of bytes read=1313864
>                 FILE: Number of bytes written=1904695
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>         Map-Reduce Framework
>                 Map input records=22048
>                 Map output records=4
>                 Map output bytes=584
>                 Map output materialized bytes=599
>                 Input split bytes=953
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=4
>                 Reduce shuffle bytes=599
>                 Reduce input records=4
>                 Reduce output records=4
>                 Spilled Records=8
>                 Shuffled Maps =1
>                 Failed Shuffles=0
>                 Merged Map outputs=1
>                 GC time elapsed (ms)=22
>                 CPU time spent (ms)=0
>                 Physical memory (bytes) snapshot=0
>                 Virtual memory (bytes) snapshot=0
>                 Total committed heap usage (bytes)=902823936
>         Generator
>                 GENERATE_MARK=4
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=0
>         File Output Format Counters
>                 Bytes Written=0
> 2019-11-13 13:22:29,238 INFO  crawl.GeneratorJob - GeneratorJob: finished at 2019-11-13 13:22:29, time elapsed: 00:00:04
> 2019-11-13 13:22:29,238 INFO  crawl.GeneratorJob - GeneratorJob: generated batch id: 1573651344-1856402192 containing 4 URLs
>  
> If I try resetting the crawldb, and injecting only one of the domains, then I can crawl it compleately fine, this problem of never fetched pages only arises if I try to work with a moderate amount of domains at the time (1300 in this case).
>  
> Is this a known problem of Nutch 2.4, or have I just misconfigured something?
>  
> -Makkara
>