Crawl dies unexpectedly

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawl dies unexpectedly

vanderkerkof
Hello everyone

I've just added 12 urls to my urls/filename file and added the same  
URLS to my craw-urlfilter.txt file and ran the crawl like so

bin/nutch crawl urls -dir crawl -depth 3

the crawl runs fine, it starts grabbing the urls and creating the  
segments, but then all of a sudden it dies with the following error  
when trying to merge the segments.

CrawlDb update: segments: [crawl/segments/20080331113907]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
20080331112151
LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
20080331111831
LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
20080331111720
LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
20080331111741
LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
20080331113907
Exception in thread "main"  
org.apache.hadoop.mapred.InvalidInputException: Input path doesnt  
exist : file:/home/nutch/nutch/trunk/crawl/segments/20080331111741/
parse_data
        at  
org
.apache
.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)

I checked one of the other segments, 20080331111720, and this  
contained the following data

drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text

But the segment with the problem in does not contain all that data, only

drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate

has anyone got any ideas what could be going wrong here?  I've checked  
space issues, loads of gigs free, and permissions on the folders are  
identical.

Here's my nutch svn details

nutch@nutch:~/nutch/trunk$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 641752
Node Kind: directory
Schedule: normal
Last Changed Author: ab
Last Changed Rev: 638782
Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)

Any  help, greatly appreciated.



Reply | Threaded
Open this post in threaded view
|

Re: Crawl dies unexpectedly

Dennis Kubes-2
If you have a crawl depth of 3 then there should be only 3 segments/*
folders.  Any idea where the others came from?

Dennis

matt davies wrote:

> Hello everyone
>
> I've just added 12 urls to my urls/filename file and added the same URLS
> to my craw-urlfilter.txt file and ran the crawl like so
>
> bin/nutch crawl urls -dir crawl -depth 3
>
> the crawl runs fine, it starts grabbing the urls and creating the
> segments, but then all of a sudden it dies with the following error when
> trying to merge the segments.
>
> CrawlDb update: segments: [crawl/segments/20080331113907]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/home/nutch/nutch/trunk/crawl/segments/20080331112151
> LinkDb: adding segment:
> file:/home/nutch/nutch/trunk/crawl/segments/20080331111831
> LinkDb: adding segment:
> file:/home/nutch/nutch/trunk/crawl/segments/20080331111720
> LinkDb: adding segment:
> file:/home/nutch/nutch/trunk/crawl/segments/20080331111741
> LinkDb: adding segment:
> file:/home/nutch/nutch/trunk/crawl/segments/20080331113907
> Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist
> : file:/home/nutch/nutch/trunk/crawl/segments/20080331111741/parse_data
>     at
> org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
>
>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>
> I checked one of the other segments, 20080331111720, and this contained
> the following data
>
> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text
>
> But the segment with the problem in does not contain all that data, only
>
> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>
> has anyone got any ideas what could be going wrong here?  I've checked
> space issues, loads of gigs free, and permissions on the folders are
> identical.
>
> Here's my nutch svn details
>
> nutch@nutch:~/nutch/trunk$ svn info
> Path: .
> URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 641752
> Node Kind: directory
> Schedule: normal
> Last Changed Author: ab
> Last Changed Rev: 638782
> Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)
>
> Any  help, greatly appreciated.
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawl dies unexpectedly

vanderkerkof
Hi Dennis

"If you have a crawl depth of 3 then there should be only 3 segments/*  
folder"

Thanks for that titbit, that makes a bit more sense now.

I have no idea where the other ones are coming from.

One of the sites I'm scanning is quit large, more than 10,000 pages,  
in total we're talking about roughly 20,000 pages.

What would you recommend setting the crawl depth to Dennis?

I've tried rerunning the crawl after deleting the entire folder that  
it was jamming on, it seems to be crawling again.

See what happens this time

Thanks for getting back to me Dennis.



On 31 Mar 2008, at 14:37, Dennis Kubes wrote:

> If you have a crawl depth of 3 then there should be only 3 segments/
> * folders.  Any idea where the others came from?
>
> Dennis
>
> matt davies wrote:
>> Hello everyone
>> I've just added 12 urls to my urls/filename file and added the same  
>> URLS to my craw-urlfilter.txt file and ran the crawl like so
>> bin/nutch crawl urls -dir crawl -depth 3
>> the crawl runs fine, it starts grabbing the urls and creating the  
>> segments, but then all of a sudden it dies with the following error  
>> when trying to merge the segments.
>> CrawlDb update: segments: [crawl/segments/20080331113907]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> LinkDb: starting
>> LinkDb: linkdb: crawl/linkdb
>> LinkDb: URL normalize: true
>> LinkDb: URL filter: true
>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>> 20080331112151
>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>> 20080331111831
>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>> 20080331111720
>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>> 20080331111741
>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>> 20080331113907
>> Exception in thread "main"  
>> org.apache.hadoop.mapred.InvalidInputException: Input path doesnt  
>> exist : file:/home/nutch/nutch/trunk/crawl/segments/20080331111741/
>> parse_data
>>    at  
>> org
>> .apache
>> .hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:
>> 154)     at  
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>> I checked one of the other segments, 20080331111720, and this  
>> contained the following data
>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text
>> But the segment with the problem in does not contain all that data,  
>> only
>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>> has anyone got any ideas what could be going wrong here?  I've  
>> checked space issues, loads of gigs free, and permissions on the  
>> folders are identical.
>> Here's my nutch svn details
>> nutch@nutch:~/nutch/trunk$ svn info
>> Path: .
>> URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
>> Repository Root: http://svn.apache.org/repos/asf
>> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
>> Revision: 641752
>> Node Kind: directory
>> Schedule: normal
>> Last Changed Author: ab
>> Last Changed Rev: 638782
>> Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)
>> Any  help, greatly appreciated.

Reply | Threaded
Open this post in threaded view
|

Re: Crawl dies unexpectedly

Susam Pal
Hi,

You seem to be using the latest revision from trunk. In the commit for
revision #637122, recrawling was introduced. So, you can crawl using
the same 'crawl' directory more than once. If you do a crawl with
-depth M in the first crawl and -depth N again, you'll end up with M +
N segments.

My guess is that you might have stopped the first crawl before
completion and the segment which remained incomplete caused the error.
If my guess is right, you would probably get the same error again due
to the same segment. If it happens, you might have to delete that
segment to proceed with the index generation.

Regards,
Susam Pal

On Mon, Mar 31, 2008 at 7:14 PM, matt davies <[hidden email]> wrote:

> Hi Dennis
>
>
>  "If you have a crawl depth of 3 then there should be only 3 segments/*
>  folder"
>
>  Thanks for that titbit, that makes a bit more sense now.
>
>  I have no idea where the other ones are coming from.
>
>  One of the sites I'm scanning is quit large, more than 10,000 pages,
>  in total we're talking about roughly 20,000 pages.
>
>  What would you recommend setting the crawl depth to Dennis?
>
>  I've tried rerunning the crawl after deleting the entire folder that
>  it was jamming on, it seems to be crawling again.
>
>  See what happens this time
>
>  Thanks for getting back to me Dennis.
>
>
>
>
>
>  On 31 Mar 2008, at 14:37, Dennis Kubes wrote:
>
>  > If you have a crawl depth of 3 then there should be only 3 segments/
>  > * folders.  Any idea where the others came from?
>  >
>  > Dennis
>  >
>  > matt davies wrote:
>  >> Hello everyone
>  >> I've just added 12 urls to my urls/filename file and added the same
>  >> URLS to my craw-urlfilter.txt file and ran the crawl like so
>  >> bin/nutch crawl urls -dir crawl -depth 3
>  >> the crawl runs fine, it starts grabbing the urls and creating the
>  >> segments, but then all of a sudden it dies with the following error
>  >> when trying to merge the segments.
>  >> CrawlDb update: segments: [crawl/segments/20080331113907]
>  >> CrawlDb update: additions allowed: true
>  >> CrawlDb update: URL normalizing: true
>  >> CrawlDb update: URL filtering: true
>  >> CrawlDb update: Merging segment data into db.
>  >> CrawlDb update: done
>  >> LinkDb: starting
>  >> LinkDb: linkdb: crawl/linkdb
>  >> LinkDb: URL normalize: true
>  >> LinkDb: URL filter: true
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331112151
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331111831
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331111720
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331111741
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331113907
>  >> Exception in thread "main"
>  >> org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
>  >> exist : file:/home/nutch/nutch/trunk/crawl/segments/20080331111741/
>  >> parse_data
>  >>    at
>  >> org
>  >> .apache
>  >> .hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:
>  >> 154)     at
>  >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
>  >>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>  >>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>  >>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>  >>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>  >> I checked one of the other segments, 20080331111720, and this
>  >> contained the following data
>  >> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
>  >> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
>  >> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>  >> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
>  >> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
>  >> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text
>  >> But the segment with the problem in does not contain all that data,
>  >> only
>  >> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>  >> has anyone got any ideas what could be going wrong here?  I've
>  >> checked space issues, loads of gigs free, and permissions on the
>  >> folders are identical.
>  >> Here's my nutch svn details
>  >> nutch@nutch:~/nutch/trunk$ svn info
>  >> Path: .
>  >> URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
>  >> Repository Root: http://svn.apache.org/repos/asf
>  >> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
>  >> Revision: 641752
>  >> Node Kind: directory
>  >> Schedule: normal
>  >> Last Changed Author: ab
>  >> Last Changed Rev: 638782
>  >> Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)
>  >> Any  help, greatly appreciated.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawl dies unexpectedly

vanderkerkof
Hi Susam

I think that is exactly what happened, I stopped a crawl and there was  
this bogus segment folder.

I then deleted the folder, ran it again and it all worked fine.

With recrawling what is the benefit?

Wouldn't it be a better/cleaner solution to delete the crawl folder  
every night and crawlf from scratch?

That way I'm not going to have to clean up segments at some point in  
the future.

The crawl takes 40 minutes in total so it's not a biggie.

What do you think Susam?

matt


On 31 Mar 2008, at 18:13, Susam Pal wrote:

> Hi,
>
> You seem to be using the latest revision from trunk. In the commit for
> revision #637122, recrawling was introduced. So, you can crawl using
> the same 'crawl' directory more than once. If you do a crawl with
> -depth M in the first crawl and -depth N again, you'll end up with M +
> N segments.
>
> My guess is that you might have stopped the first crawl before
> completion and the segment which remained incomplete caused the error.
> If my guess is right, you would probably get the same error again due
> to the same segment. If it happens, you might have to delete that
> segment to proceed with the index generation.
>
> Regards,
> Susam Pal
>
> On Mon, Mar 31, 2008 at 7:14 PM, matt davies <[hidden email]>  
> wrote:
>> Hi Dennis
>>
>>
>> "If you have a crawl depth of 3 then there should be only 3  
>> segments/*
>> folder"
>>
>> Thanks for that titbit, that makes a bit more sense now.
>>
>> I have no idea where the other ones are coming from.
>>
>> One of the sites I'm scanning is quit large, more than 10,000 pages,
>> in total we're talking about roughly 20,000 pages.
>>
>> What would you recommend setting the crawl depth to Dennis?
>>
>> I've tried rerunning the crawl after deleting the entire folder that
>> it was jamming on, it seems to be crawling again.
>>
>> See what happens this time
>>
>> Thanks for getting back to me Dennis.
>>
>>
>>
>>
>>
>> On 31 Mar 2008, at 14:37, Dennis Kubes wrote:
>>
>>> If you have a crawl depth of 3 then there should be only 3 segments/
>>> * folders.  Any idea where the others came from?
>>>
>>> Dennis
>>>
>>> matt davies wrote:
>>>> Hello everyone
>>>> I've just added 12 urls to my urls/filename file and added the same
>>>> URLS to my craw-urlfilter.txt file and ran the crawl like so
>>>> bin/nutch crawl urls -dir crawl -depth 3
>>>> the crawl runs fine, it starts grabbing the urls and creating the
>>>> segments, but then all of a sudden it dies with the following error
>>>> when trying to merge the segments.
>>>> CrawlDb update: segments: [crawl/segments/20080331113907]
>>>> CrawlDb update: additions allowed: true
>>>> CrawlDb update: URL normalizing: true
>>>> CrawlDb update: URL filtering: true
>>>> CrawlDb update: Merging segment data into db.
>>>> CrawlDb update: done
>>>> LinkDb: starting
>>>> LinkDb: linkdb: crawl/linkdb
>>>> LinkDb: URL normalize: true
>>>> LinkDb: URL filter: true
>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>> segments/
>>>> 20080331112151
>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>> segments/
>>>> 20080331111831
>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>> segments/
>>>> 20080331111720
>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>> segments/
>>>> 20080331111741
>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>> segments/
>>>> 20080331113907
>>>> Exception in thread "main"
>>>> org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
>>>> exist : file:/home/nutch/nutch/trunk/crawl/segments/20080331111741/
>>>> parse_data
>>>>   at
>>>> org
>>>> .apache
>>>> .hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:
>>>> 154)     at
>>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
>>>>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>>>>   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>>>>   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>>>>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>>>> I checked one of the other segments, 20080331111720, and this
>>>> contained the following data
>>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
>>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
>>>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>>>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
>>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
>>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text
>>>> But the segment with the problem in does not contain all that data,
>>>> only
>>>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>>>> has anyone got any ideas what could be going wrong here?  I've
>>>> checked space issues, loads of gigs free, and permissions on the
>>>> folders are identical.
>>>> Here's my nutch svn details
>>>> nutch@nutch:~/nutch/trunk$ svn info
>>>> Path: .
>>>> URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
>>>> Repository Root: http://svn.apache.org/repos/asf
>>>> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
>>>> Revision: 641752
>>>> Node Kind: directory
>>>> Schedule: normal
>>>> Last Changed Author: ab
>>>> Last Changed Rev: 638782
>>>> Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)
>>>> Any  help, greatly appreciated.
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Crawl dies unexpectedly

Susam Pal
With recrawling, the benefit is that your index can go live sooner.
That is, you can do a crawl and get an index that goes live. Then you
can keep recrawling.

Deleting the crawl folder overnight may be feasible for you since your
crawl is very short. For many of us, the crawl takes a few days to a
week, so deleting the crawl is not always feasible for us. I crawl and
re-crawl for a week and then start over with a new crawl directory
every week. Is there a better way to manage the crawls? If someone has
a better way, please share.

Regards,
Susam Pal

On Tue, Apr 1, 2008 at 1:04 PM, matt davies <[hidden email]> wrote:

> Hi Susam
>
>  I think that is exactly what happened, I stopped a crawl and there was
>  this bogus segment folder.
>
>  I then deleted the folder, ran it again and it all worked fine.
>
>  With recrawling what is the benefit?
>
>  Wouldn't it be a better/cleaner solution to delete the crawl folder
>  every night and crawlf from scratch?
>
>  That way I'm not going to have to clean up segments at some point in
>  the future.
>
>  The crawl takes 40 minutes in total so it's not a biggie.
>
>  What do you think Susam?
>
>  matt
>
>
>
>
>  On 31 Mar 2008, at 18:13, Susam Pal wrote:
>
>  > Hi,
>  >
>  > You seem to be using the latest revision from trunk. In the commit for
>  > revision #637122, recrawling was introduced. So, you can crawl using
>  > the same 'crawl' directory more than once. If you do a crawl with
>  > -depth M in the first crawl and -depth N again, you'll end up with M +
>  > N segments.
>  >
>  > My guess is that you might have stopped the first crawl before
>  > completion and the segment which remained incomplete caused the error.
>  > If my guess is right, you would probably get the same error again due
>  > to the same segment. If it happens, you might have to delete that
>  > segment to proceed with the index generation.
>  >
>  > Regards,
>  > Susam Pal
>  >
>  > On Mon, Mar 31, 2008 at 7:14 PM, matt davies <[hidden email]>
>  > wrote:
>  >> Hi Dennis
>  >>
>  >>
>  >> "If you have a crawl depth of 3 then there should be only 3
>  >> segments/*
>  >> folder"
>  >>
>  >> Thanks for that titbit, that makes a bit more sense now.
>  >>
>  >> I have no idea where the other ones are coming from.
>  >>
>  >> One of the sites I'm scanning is quit large, more than 10,000 pages,
>  >> in total we're talking about roughly 20,000 pages.
>  >>
>  >> What would you recommend setting the crawl depth to Dennis?
>  >>
>  >> I've tried rerunning the crawl after deleting the entire folder that
>  >> it was jamming on, it seems to be crawling again.
>  >>
>  >> See what happens this time
>  >>
>  >> Thanks for getting back to me Dennis.
>  >>
>  >>
>  >>
>  >>
>  >>
>  >> On 31 Mar 2008, at 14:37, Dennis Kubes wrote:
>  >>
>  >>> If you have a crawl depth of 3 then there should be only 3 segments/
>  >>> * folders.  Any idea where the others came from?
>  >>>
>  >>> Dennis
>  >>>
>  >>> matt davies wrote:
>  >>>> Hello everyone
>  >>>> I've just added 12 urls to my urls/filename file and added the same
>  >>>> URLS to my craw-urlfilter.txt file and ran the crawl like so
>  >>>> bin/nutch crawl urls -dir crawl -depth 3
>  >>>> the crawl runs fine, it starts grabbing the urls and creating the
>  >>>> segments, but then all of a sudden it dies with the following error
>  >>>> when trying to merge the segments.
>  >>>> CrawlDb update: segments: [crawl/segments/20080331113907]
>  >>>> CrawlDb update: additions allowed: true
>  >>>> CrawlDb update: URL normalizing: true
>  >>>> CrawlDb update: URL filtering: true
>  >>>> CrawlDb update: Merging segment data into db.
>  >>>> CrawlDb update: done
>  >>>> LinkDb: starting
>  >>>> LinkDb: linkdb: crawl/linkdb
>  >>>> LinkDb: URL normalize: true
>  >>>> LinkDb: URL filter: true
>  >>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>  >>>> segments/
>  >>>> 20080331112151
>  >>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>  >>>> segments/
>  >>>> 20080331111831
>  >>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>  >>>> segments/
>  >>>> 20080331111720
>  >>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>  >>>> segments/
>  >>>> 20080331111741
>  >>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>  >>>> segments/
>  >>>> 20080331113907
>  >>>> Exception in thread "main"
>  >>>> org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
>  >>>> exist : file:/home/nutch/nutch/trunk/crawl/segments/20080331111741/
>  >>>> parse_data
>  >>>>   at
>  >>>> org
>  >>>> .apache
>  >>>> .hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:
>  >>>> 154)     at
>  >>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
>  >>>>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>  >>>>   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>  >>>>   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>  >>>>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>  >>>> I checked one of the other segments, 20080331111720, and this
>  >>>> contained the following data
>  >>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
>  >>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
>  >>>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>  >>>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
>  >>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
>  >>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text
>  >>>> But the segment with the problem in does not contain all that data,
>  >>>> only
>  >>>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>  >>>> has anyone got any ideas what could be going wrong here?  I've
>  >>>> checked space issues, loads of gigs free, and permissions on the
>  >>>> folders are identical.
>  >>>> Here's my nutch svn details
>  >>>> nutch@nutch:~/nutch/trunk$ svn info
>  >>>> Path: .
>  >>>> URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
>  >>>> Repository Root: http://svn.apache.org/repos/asf
>  >>>> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
>  >>>> Revision: 641752
>  >>>> Node Kind: directory
>  >>>> Schedule: normal
>  >>>> Last Changed Author: ab
>  >>>> Last Changed Rev: 638782
>  >>>> Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)
>  >>>> Any  help, greatly appreciated.
>  >>
>  >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawl dies unexpectedly

vanderkerkof
Thanks for getting back to me Susam

I suspected that you guys are using Nutch in a scale of magnitude much  
greater than myself.

We're only using it to crawl our internal websites, it's a tad over  
20,000 pages, so as you say, in my scenario there is no real benefit  
to the recrawl, handy to know about though.

I'm going to start playing with the scoring once I've written and  
tested the automated crawl script I'm writing.

I've written and idiots instruction guide on how to install a Ubuntu  
box from scratch and get Nutch installed and running, having it index  
as many sites as you want and then how to style the results pages.

I'll email it back to this group once it's all finished off and maybe  
if you think it's worthwhile to put up on the Wiki then you can, it  
might be it might be too simplistic though.

Thanks again Susam

Speak soon

matt



On 1 Apr 2008, at 19:44, Susam Pal wrote:

> With recrawling, the benefit is that your index can go live sooner.
> That is, you can do a crawl and get an index that goes live. Then you
> can keep recrawling.
>
> Deleting the crawl folder overnight may be feasible for you since your
> crawl is very short. For many of us, the crawl takes a few days to a
> week, so deleting the crawl is not always feasible for us. I crawl and
> re-crawl for a week and then start over with a new crawl directory
> every week. Is there a better way to manage the crawls? If someone has
> a better way, please share.
>
> Regards,
> Susam Pal
>
> On Tue, Apr 1, 2008 at 1:04 PM, matt davies <[hidden email]>  
> wrote:
>> Hi Susam
>>
>> I think that is exactly what happened, I stopped a crawl and there  
>> was
>> this bogus segment folder.
>>
>> I then deleted the folder, ran it again and it all worked fine.
>>
>> With recrawling what is the benefit?
>>
>> Wouldn't it be a better/cleaner solution to delete the crawl folder
>> every night and crawlf from scratch?
>>
>> That way I'm not going to have to clean up segments at some point in
>> the future.
>>
>> The crawl takes 40 minutes in total so it's not a biggie.
>>
>> What do you think Susam?
>>
>> matt
>>
>>
>>
>>
>> On 31 Mar 2008, at 18:13, Susam Pal wrote:
>>
>>> Hi,
>>>
>>> You seem to be using the latest revision from trunk. In the commit  
>>> for
>>> revision #637122, recrawling was introduced. So, you can crawl using
>>> the same 'crawl' directory more than once. If you do a crawl with
>>> -depth M in the first crawl and -depth N again, you'll end up with  
>>> M +
>>> N segments.
>>>
>>> My guess is that you might have stopped the first crawl before
>>> completion and the segment which remained incomplete caused the  
>>> error.
>>> If my guess is right, you would probably get the same error again  
>>> due
>>> to the same segment. If it happens, you might have to delete that
>>> segment to proceed with the index generation.
>>>
>>> Regards,
>>> Susam Pal
>>>
>>> On Mon, Mar 31, 2008 at 7:14 PM, matt davies <[hidden email]>
>>> wrote:
>>>> Hi Dennis
>>>>
>>>>
>>>> "If you have a crawl depth of 3 then there should be only 3
>>>> segments/*
>>>> folder"
>>>>
>>>> Thanks for that titbit, that makes a bit more sense now.
>>>>
>>>> I have no idea where the other ones are coming from.
>>>>
>>>> One of the sites I'm scanning is quit large, more than 10,000  
>>>> pages,
>>>> in total we're talking about roughly 20,000 pages.
>>>>
>>>> What would you recommend setting the crawl depth to Dennis?
>>>>
>>>> I've tried rerunning the crawl after deleting the entire folder  
>>>> that
>>>> it was jamming on, it seems to be crawling again.
>>>>
>>>> See what happens this time
>>>>
>>>> Thanks for getting back to me Dennis.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 31 Mar 2008, at 14:37, Dennis Kubes wrote:
>>>>
>>>>> If you have a crawl depth of 3 then there should be only 3  
>>>>> segments/
>>>>> * folders.  Any idea where the others came from?
>>>>>
>>>>> Dennis
>>>>>
>>>>> matt davies wrote:
>>>>>> Hello everyone
>>>>>> I've just added 12 urls to my urls/filename file and added the  
>>>>>> same
>>>>>> URLS to my craw-urlfilter.txt file and ran the crawl like so
>>>>>> bin/nutch crawl urls -dir crawl -depth 3
>>>>>> the crawl runs fine, it starts grabbing the urls and creating the
>>>>>> segments, but then all of a sudden it dies with the following  
>>>>>> error
>>>>>> when trying to merge the segments.
>>>>>> CrawlDb update: segments: [crawl/segments/20080331113907]
>>>>>> CrawlDb update: additions allowed: true
>>>>>> CrawlDb update: URL normalizing: true
>>>>>> CrawlDb update: URL filtering: true
>>>>>> CrawlDb update: Merging segment data into db.
>>>>>> CrawlDb update: done
>>>>>> LinkDb: starting
>>>>>> LinkDb: linkdb: crawl/linkdb
>>>>>> LinkDb: URL normalize: true
>>>>>> LinkDb: URL filter: true
>>>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>>>> segments/
>>>>>> 20080331112151
>>>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>>>> segments/
>>>>>> 20080331111831
>>>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>>>> segments/
>>>>>> 20080331111720
>>>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>>>> segments/
>>>>>> 20080331111741
>>>>>> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/
>>>>>> segments/
>>>>>> 20080331113907
>>>>>> Exception in thread "main"
>>>>>> org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
>>>>>> exist : file:/home/nutch/nutch/trunk/crawl/segments/
>>>>>> 20080331111741/
>>>>>> parse_data
>>>>>>  at
>>>>>> org
>>>>>> .apache
>>>>>> .hadoop
>>>>>> .mapred.FileInputFormat.validateInput(FileInputFormat.java:
>>>>>> 154)     at
>>>>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
>>>>>>  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>>>>>>  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>>>>>>  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>>>>>>  at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>>>>>> I checked one of the other segments, 20080331111720, and this
>>>>>> contained the following data
>>>>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
>>>>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
>>>>>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>>>>>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
>>>>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
>>>>>> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text
>>>>>> But the segment with the problem in does not contain all that  
>>>>>> data,
>>>>>> only
>>>>>> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>>>>>> has anyone got any ideas what could be going wrong here?  I've
>>>>>> checked space issues, loads of gigs free, and permissions on the
>>>>>> folders are identical.
>>>>>> Here's my nutch svn details
>>>>>> nutch@nutch:~/nutch/trunk$ svn info
>>>>>> Path: .
>>>>>> URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
>>>>>> Repository Root: http://svn.apache.org/repos/asf
>>>>>> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
>>>>>> Revision: 641752
>>>>>> Node Kind: directory
>>>>>> Schedule: normal
>>>>>> Last Changed Author: ab
>>>>>> Last Changed Rev: 638782
>>>>>> Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)
>>>>>> Any  help, greatly appreciated.
>>>>
>>>>
>>
>>