Problem with latest SVN during reduce phase

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with latest SVN during reduce phase

Byron Miller-2
60111 103432 reduce > reduce
060111 103432 Optimizing index.
060111 103433 closing > reduce
060111 103434 closing > reduce
060111 103435 closing > reduce
java.lang.NullPointerException: value cannot be null
        at
org.apache.lucene.document.Field.<init>(Field.java:469)
        at
org.apache.lucene.document.Field.<init>(Field.java:412)
        at
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
        at
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
        at
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
        at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
Exception in thread "main" java.io.IOException: Job
failed!
        at
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at
org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
        at
org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
byron@db02:/data/nutch/trunk$


Pulled todays build and got above error. No problems
running out of disk space or anything like that. This
is a single instance, local file systems.

Anyway to recover the crawl/finish the reduce job from
where it failed?
Reply | Threaded
Open this post in threaded view
|

Re: Problem with latest SVN during reduce phase

Andrzej Białecki-2
Byron Miller wrote:

>Pulled todays build and got above error. No problems
>running out of disk space or anything like that. This
>is a single instance, local file systems.
>
>  
>

You need a patch that I circulated a couple of days ago, about copying
the segment name and score from content.metadata to parseData.metadata.
I was waiting for someone to test it... but this could as well be you ;-)

>Anyway to recover the crawl/finish the reduce job from
>where it failed?
>
>
>  
>

I don't think so... although it would be a nice feature.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Problem with latest SVN during reduce phase

Dominik Friedrich
In reply to this post by Byron Miller-2
I got this exception a lot, too. I haven't tested the patch by Andrzej
yet but instead I just put the doc.add() lines in the indexer reduce
function in a try-catch block . This way the indexing finishes even with
a null value and i can see which documents haven't been indexed in the
log file.

Wouldn't it be a good idea to catch every exceptions that only affect
one document in loops like this? At least I don't like it if an indexing
process dies after a few hours because one document triggers such an
exception.

best regards,
Dominik

Byron Miller wrote:

> 60111 103432 reduce > reduce
> 060111 103432 Optimizing index.
> 060111 103433 closing > reduce
> 060111 103434 closing > reduce
> 060111 103435 closing > reduce
> java.lang.NullPointerException: value cannot be null
>         at
> org.apache.lucene.document.Field.<init>(Field.java:469)
>         at
> org.apache.lucene.document.Field.<init>(Field.java:412)
>         at
> org.apache.lucene.document.Field.UnIndexed(Field.java:195)
>         at
> org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
>         at
> org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
>         at
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> Exception in thread "main" java.io.IOException: Job
> failed!
>         at
> org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
>         at
> org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
>         at
> org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> byron@db02:/data/nutch/trunk$
>
>
> Pulled todays build and got above error. No problems
> running out of disk space or anything like that. This
> is a single instance, local file systems.
>
> Anyway to recover the crawl/finish the reduce job from
> where it failed?
>
>
>  


Reply | Threaded
Open this post in threaded view
|

Re: Problem with latest SVN during reduce phase

Lukáš Vlček
Hi,
I am facing this error as well. Now I located one particular document
which is causing it (it is msword document which can't be properly
parsed by parser). I have sent it to Andrzej in separed email. Let's
see if that helps...
Lukas

On 1/11/06, Dominik Friedrich <[hidden email]> wrote:

> I got this exception a lot, too. I haven't tested the patch by Andrzej
> yet but instead I just put the doc.add() lines in the indexer reduce
> function in a try-catch block . This way the indexing finishes even with
> a null value and i can see which documents haven't been indexed in the
> log file.
>
> Wouldn't it be a good idea to catch every exceptions that only affect
> one document in loops like this? At least I don't like it if an indexing
> process dies after a few hours because one document triggers such an
> exception.
>
> best regards,
> Dominik
>
> Byron Miller wrote:
> > 60111 103432 reduce > reduce
> > 060111 103432 Optimizing index.
> > 060111 103433 closing > reduce
> > 060111 103434 closing > reduce
> > 060111 103435 closing > reduce
> > java.lang.NullPointerException: value cannot be null
> >         at
> > org.apache.lucene.document.Field.<init>(Field.java:469)
> >         at
> > org.apache.lucene.document.Field.<init>(Field.java:412)
> >         at
> > org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> >         at
> > org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> >         at
> > org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> >         at
> > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > Exception in thread "main" java.io.IOException: Job
> > failed!
> >         at
> > org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> >         at
> > org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> >         at
> > org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > byron@db02:/data/nutch/trunk$
> >
> >
> > Pulled todays build and got above error. No problems
> > running out of disk space or anything like that. This
> > is a single instance, local file systems.
> >
> > Anyway to recover the crawl/finish the reduce job from
> > where it failed?
> >
> >
> >
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with latest SVN during reduce phase

Pashabhai
Hi ,

   The very similar exception occurs while indexing a
page which do not have body content (and title
sometimes).

051223 194717 Optimizing index.
java.lang.NullPointerException
        at
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)

        at
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)

        at
org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)

        at
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)

        at


 Looking into the source code of BasicIndexingFilter.
it is trying to
doc.add(Field.UnStored("content", parse.getText()));
 
I guess adding check for null on parse object
if(parse!=null)   should solve the problem.

Can confirm when tested locally.

Thanks
P




--- Lukas Vlcek <[hidden email]> wrote:

> Hi,
> I am facing this error as well. Now I located one
> particular document
> which is causing it (it is msword document which
> can't be properly
> parsed by parser). I have sent it to Andrzej in
> separed email. Let's
> see if that helps...
> Lukas
>
> On 1/11/06, Dominik Friedrich
> <[hidden email]> wrote:
> > I got this exception a lot, too. I haven't tested
> the patch by Andrzej
> > yet but instead I just put the doc.add() lines in
> the indexer reduce
> > function in a try-catch block . This way the
> indexing finishes even with
> > a null value and i can see which documents haven't
> been indexed in the
> > log file.
> >
> > Wouldn't it be a good idea to catch every
> exceptions that only affect
> > one document in loops like this? At least I don't
> like it if an indexing
> > process dies after a few hours because one
> document triggers such an
> > exception.
> >
> > best regards,
> > Dominik
> >
> > Byron Miller wrote:
> > > 60111 103432 reduce > reduce
> > > 060111 103432 Optimizing index.
> > > 060111 103433 closing > reduce
> > > 060111 103434 closing > reduce
> > > 060111 103435 closing > reduce
> > > java.lang.NullPointerException: value cannot be
> null
> > >         at
> > >
>
org.apache.lucene.document.Field.<init>(Field.java:469)
> > >         at
> > >
>
org.apache.lucene.document.Field.<init>(Field.java:412)
> > >         at
> > >
>
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > >         at
> > >
>
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > >         at
> > >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > >         at
> > >
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > Exception in thread "main" java.io.IOException:
> Job
> > > failed!
> > >         at
> > >
>
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > >         at
> > >
>
org.apache.nutch.indexer.Indexer.index(Indexer.java:259)

> > >         at
> > >
> org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > byron@db02:/data/nutch/trunk$
> > >
> > >
> > > Pulled todays build and got above error. No
> problems
> > > running out of disk space or anything like that.
> This
> > > is a single instance, local file systems.
> > >
> > > Anyway to recover the crawl/finish the reduce
> job from
> > > where it failed?
> > >
> > >
> > >
> >
> >
> >
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Problem with latest SVN during reduce phase

Lukáš Vlček
Hi,
I think this issue can be more complex. If I remember my test
correctly then parse object was not null. Also parse.getText() was not
null (it just contained empty String).
If document is not parsed correctly then "empty" parse is returned
instead: parseStatus.getEmptyParse(); which should be OK, but I didn't
have a chance to check if this can cause any troubles during index
index optimization.
Lukas

On 1/12/06, Pashabhai <[hidden email]> wrote:

> Hi ,
>
>    The very similar exception occurs while indexing a
> page which do not have body content (and title
> sometimes).
>
> 051223 194717 Optimizing index.
> java.lang.NullPointerException
>         at
> org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
>
>         at
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
>
>         at
> org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
>
>         at
> org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
>
>         at
>
>
>  Looking into the source code of BasicIndexingFilter.
> it is trying to
> doc.add(Field.UnStored("content", parse.getText()));
>
> I guess adding check for null on parse object
> if(parse!=null)   should solve the problem.
>
> Can confirm when tested locally.
>
> Thanks
> P
>
>
>
>
> --- Lukas Vlcek <[hidden email]> wrote:
>
> > Hi,
> > I am facing this error as well. Now I located one
> > particular document
> > which is causing it (it is msword document which
> > can't be properly
> > parsed by parser). I have sent it to Andrzej in
> > separed email. Let's
> > see if that helps...
> > Lukas
> >
> > On 1/11/06, Dominik Friedrich
> > <[hidden email]> wrote:
> > > I got this exception a lot, too. I haven't tested
> > the patch by Andrzej
> > > yet but instead I just put the doc.add() lines in
> > the indexer reduce
> > > function in a try-catch block . This way the
> > indexing finishes even with
> > > a null value and i can see which documents haven't
> > been indexed in the
> > > log file.
> > >
> > > Wouldn't it be a good idea to catch every
> > exceptions that only affect
> > > one document in loops like this? At least I don't
> > like it if an indexing
> > > process dies after a few hours because one
> > document triggers such an
> > > exception.
> > >
> > > best regards,
> > > Dominik
> > >
> > > Byron Miller wrote:
> > > > 60111 103432 reduce > reduce
> > > > 060111 103432 Optimizing index.
> > > > 060111 103433 closing > reduce
> > > > 060111 103434 closing > reduce
> > > > 060111 103435 closing > reduce
> > > > java.lang.NullPointerException: value cannot be
> > null
> > > >         at
> > > >
> >
> org.apache.lucene.document.Field.<init>(Field.java:469)
> > > >         at
> > > >
> >
> org.apache.lucene.document.Field.<init>(Field.java:412)
> > > >         at
> > > >
> >
> org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > >         at
> > > >
> >
> org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > >         at
> > > >
> >
> org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > >         at
> > > >
> >
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > > Exception in thread "main" java.io.IOException:
> > Job
> > > > failed!
> > > >         at
> > > >
> >
> org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > > >         at
> > > >
> >
> org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> > > >         at
> > > >
> > org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > > byron@db02:/data/nutch/trunk$
> > > >
> > > >
> > > > Pulled todays build and got above error. No
> > problems
> > > > running out of disk space or anything like that.
> > This
> > > > is a single instance, local file systems.
> > > >
> > > > Anyway to recover the crawl/finish the reduce
> > job from
> > > > where it failed?
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with latest SVN during reduce phase

Pashabhai
Hi ,

   You are right, Parse object is not null even though
page has no content and title.

   Could it be FetcherOutput Object ???

     
P  

--- Lukas Vlcek <[hidden email]> wrote:

> Hi,
> I think this issue can be more complex. If I
> remember my test
> correctly then parse object was not null. Also
> parse.getText() was not
> null (it just contained empty String).
> If document is not parsed correctly then "empty"
> parse is returned
> instead: parseStatus.getEmptyParse(); which should
> be OK, but I didn't
> have a chance to check if this can cause any
> troubles during index
> index optimization.
> Lukas
>
> On 1/12/06, Pashabhai <[hidden email]>
> wrote:
> > Hi ,
> >
> >    The very similar exception occurs while
> indexing a
> > page which do not have body content (and title
> > sometimes).
> >
> > 051223 194717 Optimizing index.
> > java.lang.NullPointerException
> >         at
> >
>
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
> >
> >         at
> >
>
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
> >
> >         at
> >
>
org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
> >
> >         at
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)

> >
> >         at
> >
> >
> >  Looking into the source code of
> BasicIndexingFilter.
> > it is trying to
> > doc.add(Field.UnStored("content",
> parse.getText()));
> >
> > I guess adding check for null on parse object
> > if(parse!=null)   should solve the problem.
> >
> > Can confirm when tested locally.
> >
> > Thanks
> > P
> >
> >
> >
> >
> > --- Lukas Vlcek <[hidden email]> wrote:
> >
> > > Hi,
> > > I am facing this error as well. Now I located
> one
> > > particular document
> > > which is causing it (it is msword document which
> > > can't be properly
> > > parsed by parser). I have sent it to Andrzej in
> > > separed email. Let's
> > > see if that helps...
> > > Lukas
> > >
> > > On 1/11/06, Dominik Friedrich
> > > <[hidden email]> wrote:
> > > > I got this exception a lot, too. I haven't
> tested
> > > the patch by Andrzej
> > > > yet but instead I just put the doc.add() lines
> in
> > > the indexer reduce
> > > > function in a try-catch block . This way the
> > > indexing finishes even with
> > > > a null value and i can see which documents
> haven't
> > > been indexed in the
> > > > log file.
> > > >
> > > > Wouldn't it be a good idea to catch every
> > > exceptions that only affect
> > > > one document in loops like this? At least I
> don't
> > > like it if an indexing
> > > > process dies after a few hours because one
> > > document triggers such an
> > > > exception.
> > > >
> > > > best regards,
> > > > Dominik
> > > >
> > > > Byron Miller wrote:
> > > > > 60111 103432 reduce > reduce
> > > > > 060111 103432 Optimizing index.
> > > > > 060111 103433 closing > reduce
> > > > > 060111 103434 closing > reduce
> > > > > 060111 103435 closing > reduce
> > > > > java.lang.NullPointerException: value cannot
> be
> > > null
> > > > >         at
> > > > >
> > >
> >
>
org.apache.lucene.document.Field.<init>(Field.java:469)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.lucene.document.Field.<init>(Field.java:412)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > > > Exception in thread "main"
> java.io.IOException:
> > > Job
> > > > > failed!
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > > > >         at
> > > > >
> > >
> >
>
org.apache.nutch.indexer.Indexer.index(Indexer.java:259)

> > > > >         at
> > > > >
> > >
> org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > > > byron@db02:/data/nutch/trunk$
> > > > >
> > > > >
> > > > > Pulled todays build and got above error. No
> > > problems
> > > > > running out of disk space or anything like
> that.
> > > This
> > > > > is a single instance, local file systems.
> > > > >
> > > > > Anyway to recover the crawl/finish the
> reduce
> > > job from
> > > > > where it failed?
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Problem with latest SVN during reduce phase

Lukáš Vlček
Hi,

Get the latest svn version. Andrzej commited some patches yesterday
and now this issue is gone (at least it warks fine for me). I believe
that revision# 368167 is what we were about.

Regards,
Lukas

On 1/13/06, Pashabhai <[hidden email]> wrote:

> Hi ,
>
>    You are right, Parse object is not null even though
> page has no content and title.
>
>    Could it be FetcherOutput Object ???
>
>
> P
>
> --- Lukas Vlcek <[hidden email]> wrote:
>
> > Hi,
> > I think this issue can be more complex. If I
> > remember my test
> > correctly then parse object was not null. Also
> > parse.getText() was not
> > null (it just contained empty String).
> > If document is not parsed correctly then "empty"
> > parse is returned
> > instead: parseStatus.getEmptyParse(); which should
> > be OK, but I didn't
> > have a chance to check if this can cause any
> > troubles during index
> > index optimization.
> > Lukas
> >
> > On 1/12/06, Pashabhai <[hidden email]>
> > wrote:
> > > Hi ,
> > >
> > >    The very similar exception occurs while
> > indexing a
> > > page which do not have body content (and title
> > > sometimes).
> > >
> > > 051223 194717 Optimizing index.
> > > java.lang.NullPointerException
> > >         at
> > >
> >
> org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
> > >
> > >         at
> > >
> >
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
> > >
> > >         at
> > >
> >
> org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
> > >
> > >         at
> > >
> >
> org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > >
> > >         at
> > >
> > >
> > >  Looking into the source code of
> > BasicIndexingFilter.
> > > it is trying to
> > > doc.add(Field.UnStored("content",
> > parse.getText()));
> > >
> > > I guess adding check for null on parse object
> > > if(parse!=null)   should solve the problem.
> > >
> > > Can confirm when tested locally.
> > >
> > > Thanks
> > > P
> > >
> > >
> > >
> > >
> > > --- Lukas Vlcek <[hidden email]> wrote:
> > >
> > > > Hi,
> > > > I am facing this error as well. Now I located
> > one
> > > > particular document
> > > > which is causing it (it is msword document which
> > > > can't be properly
> > > > parsed by parser). I have sent it to Andrzej in
> > > > separed email. Let's
> > > > see if that helps...
> > > > Lukas
> > > >
> > > > On 1/11/06, Dominik Friedrich
> > > > <[hidden email]> wrote:
> > > > > I got this exception a lot, too. I haven't
> > tested
> > > > the patch by Andrzej
> > > > > yet but instead I just put the doc.add() lines
> > in
> > > > the indexer reduce
> > > > > function in a try-catch block . This way the
> > > > indexing finishes even with
> > > > > a null value and i can see which documents
> > haven't
> > > > been indexed in the
> > > > > log file.
> > > > >
> > > > > Wouldn't it be a good idea to catch every
> > > > exceptions that only affect
> > > > > one document in loops like this? At least I
> > don't
> > > > like it if an indexing
> > > > > process dies after a few hours because one
> > > > document triggers such an
> > > > > exception.
> > > > >
> > > > > best regards,
> > > > > Dominik
> > > > >
> > > > > Byron Miller wrote:
> > > > > > 60111 103432 reduce > reduce
> > > > > > 060111 103432 Optimizing index.
> > > > > > 060111 103433 closing > reduce
> > > > > > 060111 103434 closing > reduce
> > > > > > 060111 103435 closing > reduce
> > > > > > java.lang.NullPointerException: value cannot
> > be
> > > > null
> > > > > >         at
> > > > > >
> > > >
> > >
> >
> org.apache.lucene.document.Field.<init>(Field.java:469)
> > > > > >         at
> > > > > >
> > > >
> > >
> >
> org.apache.lucene.document.Field.<init>(Field.java:412)
> > > > > >         at
> > > > > >
> > > >
> > >
> >
> org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > > > >         at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > > > >         at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > > > >         at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > > > > Exception in thread "main"
> > java.io.IOException:
> > > > Job
> > > > > > failed!
> > > > > >         at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> > > > > >         at
> > > > > >
> > > >
> > >
> >
> org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
> > > > > >         at
> > > > > >
> > > >
> > org.apache.nutch.crawl.Crawl.main(Crawl.java:121)
> > > > > > byron@db02:/data/nutch/trunk$
> > > > > >
> > > > > >
> > > > > > Pulled todays build and got above error. No
> > > > problems
> > > > > > running out of disk space or anything like
> > that.
> > > > This
> > > > > > is a single instance, local file systems.
> > > > > >
> > > > > > Anyway to recover the crawl/finish the
> > reduce
> > > > job from
> > > > > > where it failed?
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam
> > protection around
> > > http://mail.yahoo.com
> > >
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with latest SVN during reduce phase

Byron Miller-2
I'll pull it down today and give it a shot.

thanks,
-byron

--- Lukas Vlcek <[hidden email]> wrote:

> Hi,
>
> Get the latest svn version. Andrzej commited some
> patches yesterday
> and now this issue is gone (at least it warks fine
> for me). I believe
> that revision# 368167 is what we were about.
>
> Regards,
> Lukas
>
> On 1/13/06, Pashabhai <[hidden email]>
> wrote:
> > Hi ,
> >
> >    You are right, Parse object is not null even
> though
> > page has no content and title.
> >
> >    Could it be FetcherOutput Object ???
> >
> >
> > P
> >
> > --- Lukas Vlcek <[hidden email]> wrote:
> >
> > > Hi,
> > > I think this issue can be more complex. If I
> > > remember my test
> > > correctly then parse object was not null. Also
> > > parse.getText() was not
> > > null (it just contained empty String).
> > > If document is not parsed correctly then "empty"
> > > parse is returned
> > > instead: parseStatus.getEmptyParse(); which
> should
> > > be OK, but I didn't
> > > have a chance to check if this can cause any
> > > troubles during index
> > > index optimization.
> > > Lukas
> > >
> > > On 1/12/06, Pashabhai <[hidden email]>
> > > wrote:
> > > > Hi ,
> > > >
> > > >    The very similar exception occurs while
> > > indexing a
> > > > page which do not have body content (and title
> > > > sometimes).
> > > >
> > > > 051223 194717 Optimizing index.
> > > > java.lang.NullPointerException
> > > >         at
> > > >
> > >
> >
>
org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:75)
> > > >
> > > >         at
> > > >
> > >
> >
>
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:63)
> > > >
> > > >         at
> > > >
> > >
> >
>
org.apache.nutch.crawl.Indexer.reduce(Indexer.java:217)
> > > >
> > > >         at
> > > >
> > >
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)

> > > >
> > > >         at
> > > >
> > > >
> > > >  Looking into the source code of
> > > BasicIndexingFilter.
> > > > it is trying to
> > > > doc.add(Field.UnStored("content",
> > > parse.getText()));
> > > >
> > > > I guess adding check for null on parse object
> > > > if(parse!=null)   should solve the problem.
> > > >
> > > > Can confirm when tested locally.
> > > >
> > > > Thanks
> > > > P
> > > >
> > > >
> > > >
> > > >
> > > > --- Lukas Vlcek <[hidden email]> wrote:
> > > >
> > > > > Hi,
> > > > > I am facing this error as well. Now I
> located
> > > one
> > > > > particular document
> > > > > which is causing it (it is msword document
> which
> > > > > can't be properly
> > > > > parsed by parser). I have sent it to Andrzej
> in
> > > > > separed email. Let's
> > > > > see if that helps...
> > > > > Lukas
> > > > >
> > > > > On 1/11/06, Dominik Friedrich
> > > > > <[hidden email]> wrote:
> > > > > > I got this exception a lot, too. I haven't
> > > tested
> > > > > the patch by Andrzej
> > > > > > yet but instead I just put the doc.add()
> lines
> > > in
> > > > > the indexer reduce
> > > > > > function in a try-catch block . This way
> the
> > > > > indexing finishes even with
> > > > > > a null value and i can see which documents
> > > haven't
> > > > > been indexed in the
> > > > > > log file.
> > > > > >
> > > > > > Wouldn't it be a good idea to catch every
> > > > > exceptions that only affect
> > > > > > one document in loops like this? At least
> I
> > > don't
> > > > > like it if an indexing
> > > > > > process dies after a few hours because one
> > > > > document triggers such an
> > > > > > exception.
> > > > > >
> > > > > > best regards,
> > > > > > Dominik
> > > > > >
> > > > > > Byron Miller wrote:
> > > > > > > 60111 103432 reduce > reduce
> > > > > > > 060111 103432 Optimizing index.
> > > > > > > 060111 103433 closing > reduce
> > > > > > > 060111 103434 closing > reduce
> > > > > > > 060111 103435 closing > reduce
> > > > > > > java.lang.NullPointerException: value
> cannot
> > > be
> > > > > null
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.lucene.document.Field.<init>(Field.java:469)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.lucene.document.Field.<init>(Field.java:412)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
> > > > > > >         at
> > > > > > >
> > > > >
> > > >
> > >
> >
>
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> > > > > > > Exception in thread "main"
> > > java.io.IOException:
> > > > > Job
> > > > > > > failed!
> > > > > > >         at
> > > > > > >
>
=== message truncated ===