Quantcast

readdb to dump a specific url

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

readdb to dump a specific url

Michael Coffey
I want to find out what the crawldb knows about some specific urls. According to the nutch wiki, I should use nutch readdb with the -url option. But when I do a command like the following, I get nasty "can't find class" exceptions.


$NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/popular/data/crawldb -url http://fabulous.com/

The error message isException in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
        at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:212)
        at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:167)
        at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:317)
        at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2256)
        at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:680)
        at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:99)
        at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:465)
        at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:472)
        at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:717)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:736)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)



The above message occurs for any url that is actually in the crawldb. If I specify a url that does not exist, I get a more understandable message. Also, nutch readdb -stats works reliably.
How can we make this work?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: readdb to dump a specific url

Markus Jelsma-2
Hi - this very long standing problem has been fixed in a Hadoop more recent than you are using now. Upgrade to 2.7.3 or 2.8.0 if that's out some day.

Markus

 
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Saturday 4th March 2017 3:49
> To: User <[hidden email]>
> Subject: readdb to dump a specific url
>
> I want to find out what the crawldb knows about some specific urls. According to the nutch wiki, I should use nutch readdb with the -url option. But when I do a command like the following, I get nasty "can't find class" exceptions.
>
>
> $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/popular/data/crawldb -url http://fabulous.com/
>
> The error message isException in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus
>         at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:212)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:167)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:317)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2256)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:680)
>         at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:99)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:465)
>         at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:472)
>         at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:717)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:736)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>
>
>
> The above message occurs for any url that is actually in the crawldb. If I specify a url that does not exist, I get a more understandable message. Also, nutch readdb -stats works reliably.
> How can we make this work?
>
Loading...