readdb to dump a specific url

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

readdb to dump a specific url

Michael Coffey
I want to find out what the crawldb knows about some specific urls. According to the nutch wiki, I should use nutch readdb with the -url option. But when I do a command like the following, I get nasty "can't find class" exceptions.


$NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/popular/data/crawldb -url http://fabulous.com/

The error message isException in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
        at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:212)
        at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:167)
        at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:317)
        at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2256)
        at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:680)
        at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:99)
        at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:465)
        at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:472)
        at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:717)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:736)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)



The above message occurs for any url that is actually in the crawldb. If I specify a url that does not exist, I get a more understandable message. Also, nutch readdb -stats works reliably.
How can we make this work?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: readdb to dump a specific url

Markus Jelsma-2
Hi - this very long standing problem has been fixed in a Hadoop more recent than you are using now. Upgrade to 2.7.3 or 2.8.0 if that's out some day.

Markus

 
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Saturday 4th March 2017 3:49
> To: User <[hidden email]>
> Subject: readdb to dump a specific url
>
> I want to find out what the crawldb knows about some specific urls. According to the nutch wiki, I should use nutch readdb with the -url option. But when I do a command like the following, I get nasty "can't find class" exceptions.
>
>
> $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/popular/data/crawldb -url http://fabulous.com/
>
> The error message isException in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:212)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:167)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:317)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2256)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:680)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:99)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:465)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:472)
>         at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:717)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:736)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>
>
>
> The above message occurs for any url that is actually in the crawldb. If I specify a url that does not exist, I get a more understandable message. Also, nutch readdb -stats works reliably.
> How can we make this work?
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Speed of linkDB Merge

Michael Coffey
In my situation, I find that linkdb merge takes much more time than fetch and parse combined, even though fetch is fully polite.

What is the standard advice for making linkdb-merge go faster?

I call invertlinks like this:
__bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT

invertlinks  seems to call mergelinkdb automatically.

I currently have about 3-6 slaves for fetching, though that will increase soon. I am currently using small segment sizes (3000 urls) but I can increase that if it would help.

I have the following properties that may be relevant.

<property>
  <name>db.max.outlinks.per.page</name>
  <value>1000</value>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
</property>


The following props are left as default in nutch-default.xml

<property>
  <name>db.update.max.inlinks</name>
  <value>10000</value>
</property>

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  </description>
</property>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Speed of linkDB Merge

Sebastian Nagel
Hi Michael,

what is the size of your linkdb? If it's large (significantly larger than the segment)
the reason is easily explained: the linkdb needs to be rewritten on every invertlinks step.
That's an expensive action becoming more expensive for larger crawls. Unless you really
need the linkdb to add anchor texts to your index you could:
 - either limit the linkdb size by excluding internal links
 - or update it less frequently (multiple segments in one turn)
A segment size of 3000 URLs seems small for a distributed crawl with a large number of different
hosts or domains. You may observe similar problems updating the CrawlDb, although later because
the CrawlDb is usually smaller, esp. if the linkdb includes also internal links.

Best,
Sebastian

On 04/03/2017 02:08 AM, Michael Coffey wrote:

> In my situation, I find that linkdb merge takes much more time than fetch and parse combined, even though fetch is fully polite.
>
> What is the standard advice for making linkdb-merge go faster?
>
> I call invertlinks like this:
> __bin_nutch invertlinks "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
>
> invertlinks  seems to call mergelinkdb automatically.
>
> I currently have about 3-6 slaves for fetching, though that will increase soon. I am currently using small segment sizes (3000 urls) but I can increase that if it would help.
>
> I have the following properties that may be relevant.
>
> <property>
>   <name>db.max.outlinks.per.page</name>
>   <value>1000</value>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>false</value>
> </property>
>
>
> The following props are left as default in nutch-default.xml
>
> <property>
>   <name>db.update.max.inlinks</name>
>   <value>10000</value>
> </property>
>
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   </description>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>false</value>
>   </description>
> </property>
>

Loading...