readlinkdb fails to dump linkdb

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

readlinkdb fails to dump linkdb

brainstorm-2-2
Using nutch 0.9 (hadoop 0.17.1):

[hadoop@cluster working]$ bin/nutch readlinkdb
/home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
LinkDb dump: starting
LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
java.io.IOException: Type mismatch in value from map: expected
org.apache.nutch.crawl.Inlinks, recieved
org.apache.nutch.crawl.CrawlDatum
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
        at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

LinkDbReader: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
        at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
        at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)

This is the first time I use readlinkdb and the rest of the crawling
process is working ok, I've looked up JIRA and there's no related bug.

I've also tried latest trunk nutch but DFS is not working for me:

[hadoop@cluster trunk]$ bin/hadoop dfs -ls

Exception in thread "main" java.lang.RuntimeException:
java.lang.ClassNotFoundException:
org.apache.hadoop.hdfs.DistributedFileSystem
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
        at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hdfs.DistributedFileSystem
        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
        ... 10 more

Should I file both bugs on JIRA ?
Reply | Threaded
Open this post in threaded view
|

Re: readlinkdb fails to dump linkdb

Doğacan Güney-3
On Wed, Dec 3, 2008 at 8:55 PM, brainstorm <[hidden email]> wrote:
> Using nutch 0.9 (hadoop 0.17.1):
>
> [hadoop@cluster working]$ bin/nutch readlinkdb
> /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
> LinkDb dump: starting
> LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It seems you are providing a crawldb as argument. You should pass the linkdb.

> java.io.IOException: Type mismatch in value from map: expected
> org.apache.nutch.crawl.Inlinks, recieved
> org.apache.nutch.crawl.CrawlDatum
>        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
>        at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
>        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
>
> LinkDbReader: java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
>        at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
>        at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)
>
> This is the first time I use readlinkdb and the rest of the crawling
> process is working ok, I've looked up JIRA and there's no related bug.
>
> I've also tried latest trunk nutch but DFS is not working for me:
>
> [hadoop@cluster trunk]$ bin/hadoop dfs -ls
>
> Exception in thread "main" java.lang.RuntimeException:
> java.lang.ClassNotFoundException:
> org.apache.hadoop.hdfs.DistributedFileSystem
>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
>        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
>        at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
>        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.hdfs.DistributedFileSystem
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
>        at java.lang.Class.forName0(Native Method)
>        at java.lang.Class.forName(Class.java:247)
>        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
>        ... 10 more
>
> Should I file both bugs on JIRA ?
>

This I am not sure, but did you try ant clean; ant? It may be a
version mismatch.


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: readlinkdb fails to dump linkdb

brainstorm-2-2
On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney <[hidden email]> wrote:

> On Wed, Dec 3, 2008 at 8:55 PM, brainstorm <[hidden email]> wrote:
>> Using nutch 0.9 (hadoop 0.17.1):
>>
>> [hadoop@cluster working]$ bin/nutch readlinkdb
>> /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
>> LinkDb dump: starting
>> LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
>                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> It seems you are providing a crawldb as argument. You should pass the linkdb.


Thanks a lot for the hint, but I cannot find "linkdb" dir anywhere on
the HDFS :_/ Can you point me where should it be ?


>> java.io.IOException: Type mismatch in value from map: expected
>> org.apache.nutch.crawl.Inlinks, recieved
>> org.apache.nutch.crawl.CrawlDatum
>>        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
>>        at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
>>        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
>>
>> LinkDbReader: java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
>>        at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
>>        at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)
>>
>> This is the first time I use readlinkdb and the rest of the crawling
>> process is working ok, I've looked up JIRA and there's no related bug.
>>
>> I've also tried latest trunk nutch but DFS is not working for me:
>>
>> [hadoop@cluster trunk]$ bin/hadoop dfs -ls
>>
>> Exception in thread "main" java.lang.RuntimeException:
>> java.lang.ClassNotFoundException:
>> org.apache.hadoop.hdfs.DistributedFileSystem
>>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
>>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
>>        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
>>        at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
>>        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.hadoop.hdfs.DistributedFileSystem
>>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
>>        at java.lang.Class.forName0(Native Method)
>>        at java.lang.Class.forName(Class.java:247)
>>        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
>>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
>>        ... 10 more
>>
>> Should I file both bugs on JIRA ?
>>
>
> This I am not sure, but did you try ant clean; ant? It may be a
> version mismatch.


Yes, I did ant clean && ant before trying the above command. I also
tried to upgrade the filesystem unsuccessfully and even created it
from scratch:

https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650556#action_12650556


>
> --
> Doğacan Güney
>
Reply | Threaded
Open this post in threaded view
|

Re: readlinkdb fails to dump linkdb

Doğacan Güney-3
On Thu, Dec 4, 2008 at 11:33 AM, brainstorm <[hidden email]> wrote:

> On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney <[hidden email]> wrote:
>> On Wed, Dec 3, 2008 at 8:55 PM, brainstorm <[hidden email]> wrote:
>>> Using nutch 0.9 (hadoop 0.17.1):
>>>
>>> [hadoop@cluster working]$ bin/nutch readlinkdb
>>> /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
>>> LinkDb dump: starting
>>> LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
>>                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> It seems you are providing a crawldb as argument. You should pass the linkdb.
>
>
> Thanks a lot for the hint, but I cannot find "linkdb" dir anywhere on
> the HDFS :_/ Can you point me where should it be ?

A linkdb is created with the command: invertlinks, e.g:

bin/nutch invertlinks crawl/linkdb crawl/segments/....

>
>
>>> java.io.IOException: Type mismatch in value from map: expected
>>> org.apache.nutch.crawl.Inlinks, recieved
>>> org.apache.nutch.crawl.CrawlDatum
>>>        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
>>>        at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
>>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
>>>        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
>>>
>>> LinkDbReader: java.io.IOException: Job failed!
>>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
>>>        at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
>>>        at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)
>>>
>>> This is the first time I use readlinkdb and the rest of the crawling
>>> process is working ok, I've looked up JIRA and there's no related bug.
>>>
>>> I've also tried latest trunk nutch but DFS is not working for me:
>>>
>>> [hadoop@cluster trunk]$ bin/hadoop dfs -ls
>>>
>>> Exception in thread "main" java.lang.RuntimeException:
>>> java.lang.ClassNotFoundException:
>>> org.apache.hadoop.hdfs.DistributedFileSystem
>>>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
>>>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
>>>        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
>>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
>>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
>>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
>>>        at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
>>>        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.hadoop.hdfs.DistributedFileSystem
>>>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>>>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>>>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
>>>        at java.lang.Class.forName0(Native Method)
>>>        at java.lang.Class.forName(Class.java:247)
>>>        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
>>>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
>>>        ... 10 more
>>>
>>> Should I file both bugs on JIRA ?
>>>
>>
>> This I am not sure, but did you try ant clean; ant? It may be a
>> version mismatch.
>
>
> Yes, I did ant clean && ant before trying the above command. I also
> tried to upgrade the filesystem unsuccessfully and even created it
> from scratch:
>
> https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650556#action_12650556
>
>
>>
>> --
>> Doğacan Güney
>>
>



--
Doğacan Güney