CrawlDB data-loss and unable to inject 1.12 on Hadoop 2.7.3

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

CrawlDB data-loss and unable to inject 1.12 on Hadoop 2.7.3

Markus Jelsma-2
Hello,

This wednesday we experienced trouble running the 1.12 injector on Hadoop 2.7.3. We operated 2.7.2 before and we had no trouble running a job.

2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
        at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
        at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
        at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
        at org.apache.nutch.crawl.Injector.run(Injector.java:467)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:441)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Our processes retried injecting for a few minutes until we manually shut it down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or backups we could restore it, so enable those if you haven't done so yet.

These freak Hadoop errors can be notoriously difficult to debug but it seems we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also in luck if your job file uses the old org.hadoop.mapred.* API, only jobs using the org.hadoop.mapreduce.* API seem to fail.

Reference issue: https://issues.apache.org/jira/browse/NUTCH-2354

Regards,
Markus
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CrawlDB data-loss and unable to inject 1.12 on Hadoop 2.7.3

Sebastian Nagel
Hi Markus,

after having once faced failing jobs due to dependency issues,
I started to compile the Nutch.job with the same Hadoop version
of the cluster. That's little extra time to change the ivy.xml
and rarely resolve a conflicting dependency, but to fix broken
data in the cluster costs you much more.


> Reference issue: https://issues.apache.org/jira/browse/NUTCH-2354

What about the opposite, running Nutch.job compiled with 2.7.3 on a 2.7.2 Hadoop?
Nothing against upgrading, but in doubt it would be good to know.


Thanks,
Sebastian


On 01/20/2017 02:23 PM, Markus Jelsma wrote:

> Hello,
>
> This wednesday we experienced trouble running the 1.12 injector on Hadoop 2.7.3. We operated 2.7.2 before and we had no trouble running a job.
>
> 2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
> at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
> at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:467)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:441)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>
> Our processes retried injecting for a few minutes until we manually shut it down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or backups we could restore it, so enable those if you haven't done so yet.
>
> These freak Hadoop errors can be notoriously difficult to debug but it seems we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also in luck if your job file uses the old org.hadoop.mapred.* API, only jobs using the org.hadoop.mapreduce.* API seem to fail.
>
> Reference issue: https://issues.apache.org/jira/browse/NUTCH-2354
>
> Regards,
> Markus
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: CrawlDB data-loss and unable to inject 1.12 on Hadoop 2.7.3

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Hello Sebastian,

I am not sure what will happen with having it compiled with 2.7.3 but running it on 2.7.2. Since the other way around caused trouble (which usually doesn't happen), we could assume this might not work well either. Unfortunately i cannot test it, both our Hadoop clusters have already been upgraded.

Everyone would either have to recompile Nutch themselves or upgrade their Hadoop cluster, the latter is mostly a good thing, 2.7.2 and 2.7.3 fixed long-standing issues for Nutch.

The question is, what do we do.

Thanks,
Markus
 
-----Original message-----

> From:Sebastian Nagel <[hidden email]>
> Sent: Saturday 21st January 2017 19:57
> To: [hidden email]
> Subject: Re: CrawlDB data-loss and unable to inject 1.12 on Hadoop 2.7.3
>
> Hi Markus,
>
> after having once faced failing jobs due to dependency issues,
> I started to compile the Nutch.job with the same Hadoop version
> of the cluster. That's little extra time to change the ivy.xml
> and rarely resolve a conflicting dependency, but to fix broken
> data in the cluster costs you much more.
>
>
> > Reference issue: https://issues.apache.org/jira/browse/NUTCH-2354
>
> What about the opposite, running Nutch.job compiled with 2.7.3 on a 2.7.2 Hadoop?
> Nothing against upgrading, but in doubt it would be good to know.
>
>
> Thanks,
> Sebastian
>
>
> On 01/20/2017 02:23 PM, Markus Jelsma wrote:
> > Hello,
> >
> > This wednesday we experienced trouble running the 1.12 injector on Hadoop 2.7.3. We operated 2.7.2 before and we had no trouble running a job.
> >
> > 2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
> > at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
> > at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> > Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
> >         at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
> >         at org.apache.nutch.crawl.Injector.run(Injector.java:467)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >         at org.apache.nutch.crawl.Injector.main(Injector.java:441)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:498)
> >         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> >         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> >
> > Our processes retried injecting for a few minutes until we manually shut it down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or backups we could restore it, so enable those if you haven't done so yet.
> >
> > These freak Hadoop errors can be notoriously difficult to debug but it seems we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also in luck if your job file uses the old org.hadoop.mapred.* API, only jobs using the org.hadoop.mapreduce.* API seem to fail.
> >
> > Reference issue: https://issues.apache.org/jira/browse/NUTCH-2354
> >
> > Regards,
> > Markus
> >
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: CrawlDB data-loss and unable to inject 1.12 on Hadoop 2.7.3

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Hmm - i may have been wrong about recompiling. For some reason, the problem persists but seems to be causes by custom patches. I confirmed that 1.12 and master both run fine on Hadoop 2.7.3, whether or not it is compiled with 2.7.3 or 2.7.2.

Regards,
Markus

 
 
-----Original message-----

> From:Markus Jelsma <[hidden email]>
> Sent: Wednesday 25th January 2017 13:30
> To: [hidden email]
> Subject: RE: CrawlDB data-loss and unable to inject 1.12 on Hadoop 2.7.3
>
> Hello Sebastian,
>
> I am not sure what will happen with having it compiled with 2.7.3 but running it on 2.7.2. Since the other way around caused trouble (which usually doesn't happen), we could assume this might not work well either. Unfortunately i cannot test it, both our Hadoop clusters have already been upgraded.
>
> Everyone would either have to recompile Nutch themselves or upgrade their Hadoop cluster, the latter is mostly a good thing, 2.7.2 and 2.7.3 fixed long-standing issues for Nutch.
>
> The question is, what do we do.
>
> Thanks,
> Markus

> -----Original message-----
> > From:Sebastian Nagel <[hidden email]>
> > Sent: Saturday 21st January 2017 19:57
> > To: [hidden email]
> > Subject: Re: CrawlDB data-loss and unable to inject 1.12 on Hadoop 2.7.3
> >
> > Hi Markus,
> >
> > after having once faced failing jobs due to dependency issues,
> > I started to compile the Nutch.job with the same Hadoop version
> > of the cluster. That's little extra time to change the ivy.xml
> > and rarely resolve a conflicting dependency, but to fix broken
> > data in the cluster costs you much more.
> >
> >
> > > Reference issue: https://issues.apache.org/jira/browse/NUTCH-2354
> >
> > What about the opposite, running Nutch.job compiled with 2.7.3 on a 2.7.2 Hadoop?
> > Nothing against upgrading, but in doubt it would be good to know.
> >
> >
> > Thanks,
> > Sebastian
> >
> >
> > On 01/20/2017 02:23 PM, Markus Jelsma wrote:
> > > Hello,
> > >
> > > This wednesday we experienced trouble running the 1.12 injector on Hadoop 2.7.3. We operated 2.7.2 before and we had no trouble running a job.
> > >
> > > 2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
> > > at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
> > > at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
> > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> > > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> > > at java.security.AccessController.doPrivileged(Native Method)
> > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> > > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> > > Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
> > >         at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
> > >         at org.apache.nutch.crawl.Injector.run(Injector.java:467)
> > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > >         at org.apache.nutch.crawl.Injector.main(Injector.java:441)
> > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > >         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >         at java.lang.reflect.Method.invoke(Method.java:498)
> > >         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> > >         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> > >
> > > Our processes retried injecting for a few minutes until we manually shut it down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or backups we could restore it, so enable those if you haven't done so yet.
> > >
> > > These freak Hadoop errors can be notoriously difficult to debug but it seems we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also in luck if your job file uses the old org.hadoop.mapred.* API, only jobs using the org.hadoop.mapreduce.* API seem to fail.
> > >
> > > Reference issue: https://issues.apache.org/jira/browse/NUTCH-2354
> > >
> > > Regards,
> > > Markus
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Speed of linkDB

Michael Coffey
In reply to this post by Sebastian Nagel
Thank you, Sebastian, that sounds like a great suggestion! You're right that 3000 is a small segment size. I am using 3000 per slave just in this still-early testing phase. I don't know the actual size of my linkdb, but my crawldb has over 48 million urls so far, of which over 1.5 million have been fetched.


I think I need the linkdb because incoming anchors are important for search-engine relevance, right?


________________________________
From: Sebastian Nagel <[hidden email]>


Hi Michael, what is the size of your linkdb? If it's large (significantly larger than the segment)
the reason is easily explained: the linkdb needs to be rewritten on every invertlinks step.
That's an expensive action becoming more expensive for larger crawls. Unless you really
need the linkdb to add anchor texts to your index you could: - either limit the linkdb size by excluding internal links - or update it less frequently (multiple segments in one turn)
A segment size of 3000 URLs seems small for a distributed crawl with a large number of different
hosts or domains. You may observe similar problems updating the CrawlDb, although later because
the CrawlDb is usually smaller, esp. if the linkdb includes also internal links. Best,
Sebastian On 04/03/2017 02:08 AM, Michael Coffey wrote:
> In my situation, I find that linkdb merge takes much more time than fetch and parse combined,
even though fetch is fully polite.
>
> What is the standard advice for making linkdb-merge go faster?
>
> I call invertlinks like this:
> __bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
>
> invertlinks  seems to call mergelinkdb automatically.
>
> I currently have about 3-6 slaves for fetching, though that will increase soon. I am
currently using small segment sizes (3000 urls) but I can increase that if it would help.

>
> I have the following properties that may be relevant.
>
> <property>
>  <name>db.max.outlinks.per.page</name>
>  <value>1000</value>
> </property>
>
> <property>
>  <name>db.ignore.external.links</name>
>  <value>false</value>
> </property>
>
>
> The following props are left as default in nutch-default.xml
>
> <property>
>  <name>db.update.max.inlinks</name>
>  <value>10000</value>
> </property>
>
> <property>
>  <name>db.ignore.internal.links</name>
>  <value>false</value>
>  </description>
> </property>
>
> <property>
>  <name>db.ignore.external.links</name>
>  <value>false</value>
>  </description>
> </property>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Speed of linkDB

Markus Jelsma-2
Hello Michael - see inline.
Markus
 
-----Original message-----
> From:Michael Coffey <[hidden email]>
> Sent: Tuesday 4th April 2017 21:32
> To: [hidden email]
> Subject: Re: Speed of linkDB
>
> Thank you, Sebastian, that sounds like a great suggestion! You're right that 3000 is a small segment size. I am using 3000 per slave just in this still-early testing phase. I don't know the actual size of my linkdb, but my crawldb has over 48 million urls so far, of which over 1.5 million have been fetched.

If i remember correctly, LinkDB is filtering and normalizing by defeault. Disable it via noFilter and noNormalize to speed it all up quite a bit. Also, enable map file compression, it grealtly reduces IO. And, as Sebastian mentioned, do not run LinkDB on every segment, but once a day orso on all segments fetched that day.

>
>
> I think I need the linkdb because incoming anchors are important for search-engine relevance, right?

In theory, yes. But a great many other things are probably much more important such as text extraction and analysis. If you reduce the number of inlinks per record to a few, you probably already have all the linking keywords.


>
>
> ________________________________
> From: Sebastian Nagel <[hidden email]>
>
>
> Hi Michael, what is the size of your linkdb? If it's large (significantly larger than the segment)
> the reason is easily explained: the linkdb needs to be rewritten on every invertlinks step.
> That's an expensive action becoming more expensive for larger crawls. Unless you really
> need the linkdb to add anchor texts to your index you could: - either limit the linkdb size by excluding internal links - or update it less frequently (multiple segments in one turn)
> A segment size of 3000 URLs seems small for a distributed crawl with a large number of different
> hosts or domains. You may observe similar problems updating the CrawlDb, although later because
> the CrawlDb is usually smaller, esp. if the linkdb includes also internal links. Best,
> Sebastian On 04/03/2017 02:08 AM, Michael Coffey wrote:
> > In my situation, I find that linkdb merge takes much more time than fetch and parse combined,
> even though fetch is fully polite.
> >
> > What is the standard advice for making linkdb-merge go faster?
> >
> > I call invertlinks like this:
> > __bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
> >
> > invertlinks  seems to call mergelinkdb automatically.
> >
> > I currently have about 3-6 slaves for fetching, though that will increase soon. I am
> currently using small segment sizes (3000 urls) but I can increase that if it would help.
> >
> > I have the following properties that may be relevant.
> >
> > <property>
> >  <name>db.max.outlinks.per.page</name>
> >  <value>1000</value>
> > </property>
> >
> > <property>
> >  <name>db.ignore.external.links</name>
> >  <value>false</value>
> > </property>
> >
> >
> > The following props are left as default in nutch-default.xml
> >
> > <property>
> >  <name>db.update.max.inlinks</name>
> >  <value>10000</value>
> > </property>
> >
> > <property>
> >  <name>db.ignore.internal.links</name>
> >  <value>false</value>
> >  </description>
> > </property>
> >
> > <property>
> >  <name>db.ignore.external.links</name>
> >  <value>false</value>
> >  </description>
> > </property>
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Speed of linkDB

Michael Coffey
In reply to this post by Michael Coffey
I am curious about the noFilter and noNormalize options for linkdb, suggested by Marcus. What do the default normalize and filtering operations do, and what would I be losing by turning them off?
Still looking to speed up the process. Now using topN 96000 and doing linkdb on multiple segments per job. Surprised to see that the linkd-merge job seems to be CPU-bound, according to sysstat.


      From: Michael Coffey <[hidden email]>
 To: "[hidden email]" <[hidden email]>
 Sent: Tuesday, April 4, 2017 12:25 PM
 Subject: Re: Speed of linkDB
   
Thank you, Sebastian, that sounds like a great suggestion! You're right that 3000 is a small segment size. I am using 3000 per slave just in this still-early testing phase. I don't know the actual size of my linkdb, but my crawldb has over 48 million urls so far, of which over 1.5 million have been fetched.


I think I need the linkdb because incoming anchors are important for search-engine relevance, right?


________________________________
From: Sebastian Nagel <[hidden email]>


Hi Michael, what is the size of your linkdb? If it's large (significantly larger than the segment)
the reason is easily explained: the linkdb needs to be rewritten on every invertlinks step.
That's an expensive action becoming more expensive for larger crawls. Unless you really
need the linkdb to add anchor texts to your index you could: - either limit the linkdb size by excluding internal links - or update it less frequently (multiple segments in one turn)
A segment size of 3000 URLs seems small for a distributed crawl with a large number of different
hosts or domains. You may observe similar problems updating the CrawlDb, although later because
the CrawlDb is usually smaller, esp. if the linkdb includes also internal links. Best,
Sebastian On 04/03/2017 02:08 AM, Michael Coffey wrote:
> In my situation, I find that linkdb merge takes much more time than fetch and parse combined,
even though fetch is fully polite.
>
> What is the standard advice for making linkdb-merge go faster?
>
> I call invertlinks like this:
> __bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
>
> invertlinks  seems to call mergelinkdb automatically.
>
> I currently have about 3-6 slaves for fetching, though that will increase soon. I am
currently using small segment sizes (3000 urls) but I can increase that if it would help.

>
> I have the following properties that may be relevant.
>
> <property>
>  <name>db.max.outlinks.per.page</name>
>  <value>1000</value>
> </property>
>
> <property>
>  <name>db.ignore.external.links</name>
>  <value>false</value>
> </property>
>
>
> The following props are left as default in nutch-default.xml
>
> <property>
>  <name>db.update.max.inlinks</name>
>  <value>10000</value>
> </property>
>
> <property>
>  <name>db.ignore.internal.links</name>
>  <value>false</value>
>  </description>
> </property>
>
> <property>
>  <name>db.ignore.external.links</name>
>  <value>false</value>
>  </description>
> </property>
>


   
Loading...