Reduce segment size

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Reduce segment size

Ledio Ago
Hi there!
 
After a crawl/index cycle a segment directory is created which usually
contains content, index, and so on directories.
Here is what actually my current segment directory has after crawl/index
build of 2 Million URLs:
 
/segments/20070114151631> du -sh *
9.6G    content
212M    fetcher
5.0G    index
0       index.done
5.8G    parse_data
3.7G    parse_text
 
The segment directory is copied to a searcher.  As you can see the
content directory is huge.
 
My question is, if you just remove this directory, would that affect the
search capability, or later the recrawling and reindexing?
The content directory is so big, is there is a way not to have to copy
that directory to the searcher?
 
Thanks,
Ledio

 
 

Ledio Ago * Sr. Software Engineer * [hidden email]

w: 415-348-7693 * f: 415-348-7032

 

LookSmart - Where To Look For What You Need. - Find. Save. Share.

625 Second Street, San Francisco, CA 94107

 
Reply | Threaded
Open this post in threaded view
|

Re: Reduce segment size

Sean Dean-3
It wont affect re-crawling as that's dependant on the Nutch DB, but it will prevent you from re-indexing the data that was deleted as it needs those files.
 
I have never tried running Nutch "just" with the index file, it might work or it might not but its something to test (move them out of the directory, but don't delete them).


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Thursday, January 18, 2007 8:57:15 PM
Subject: Reduce segment size


Hi there!

After a crawl/index cycle a segment directory is created which usually
contains content, index, and so on directories.
Here is what actually my current segment directory has after crawl/index
build of 2 Million URLs:

/segments/20070114151631> du -sh *
9.6G    content
212M    fetcher
5.0G    index
0       index.done
5.8G    parse_data
3.7G    parse_text

The segment directory is copied to a searcher.  As you can see the
content directory is huge.

My question is, if you just remove this directory, would that affect the
search capability, or later the recrawling and reindexing?
The content directory is so big, is there is a way not to have to copy
that directory to the searcher?

Thanks,
Ledio




Ledio Ago * Sr. Software Engineer * [hidden email]

w: 415-348-7693 * f: 415-348-7032



LookSmart - Where To Look For What You Need. - Find. Save. Share.

625 Second Street, San Francisco, CA 94107
Reply | Threaded
Open this post in threaded view
|

java.lang.OutOfMemoryError - trunk

Gal Nitzan

Thanks Sean,

I get out of memory errors.

I have set max heap for both nutch and hadoop 2000mb each but it doesn't
seem to affect anything. The out of memory happenes immediately after start
of a task.

Any idea?

java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2786)
        at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.io.Text.writeString(Text.java:399)
        at org.apache.nutch.parse.Outlink.write(Outlink.java:52)
        at org.apache.nutch.parse.ParseData.write(ParseData.java:163)
        at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:55)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:323)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:96)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1367)




Reply | Threaded
Open this post in threaded view
|

Reduce segment size

Ledio Ago
In reply to this post by Ledio Ago
Hi there!
 
After a crawl/index cycle a segment directory is created which usually
contains content, index, and so on directories.
Here is what actually my current segment directory has after crawl/index
build of 2 Million URLs:
 
/segments/20070114151631> du -sh *
9.6G    content
212M    fetcher
5.0G    index
0       index.done
5.8G    parse_data
3.7G    parse_text
 
The segment directory is copied to a searcher.  As you can see the
content directory is huge.
 
My question is, if you just remove this directory, would that affect the
search capability, or later the recrawling and reindexing?
The content directory is so big, is there is a way not to have to copy
that directory to the searcher?
 
Thanks,
Ledio

Reply | Threaded
Open this post in threaded view
|

RE: Reduce segment size

Ledio Ago
In reply to this post by Sean Dean-3
Thanks, I'll give it a try.

-Ledio

P.s. sorry for resending my previous email.

-----Original Message-----
From: Sean Dean [mailto:[hidden email]]
Sent: Thursday, January 18, 2007 11:05 PM
To: [hidden email]
Subject: Re: Reduce segment size

It wont affect re-crawling as that's dependant on the Nutch DB, but it
will prevent you from re-indexing the data that was deleted as it needs
those files.
 
I have never tried running Nutch "just" with the index file, it might
work or it might not but its something to test (move them out of the
directory, but don't delete them).


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Thursday, January 18, 2007 8:57:15 PM
Subject: Reduce segment size


Hi there!

After a crawl/index cycle a segment directory is created which usually
contains content, index, and so on directories.
Here is what actually my current segment directory has after crawl/index
build of 2 Million URLs:

/segments/20070114151631> du -sh *
9.6G    content
212M    fetcher
5.0G    index
0       index.done
5.8G    parse_data
3.7G    parse_text

The segment directory is copied to a searcher.  As you can see the
content directory is huge.

My question is, if you just remove this directory, would that affect the
search capability, or later the recrawling and reindexing?
The content directory is so big, is there is a way not to have to copy
that directory to the searcher?

Thanks,
Ledio




Ledio Ago * Sr. Software Engineer * [hidden email]

w: 415-348-7693 * f: 415-348-7032



LookSmart - Where To Look For What You Need. - Find. Save. Share.

625 Second Street, San Francisco, CA 94107
Reply | Threaded
Open this post in threaded view
|

Re: java.lang.OutOfMemoryError - trunk

Sean Dean-3
In reply to this post by Gal Nitzan
What OS are you using with Nutch, and what version of JVM?
 
If its Linux, paste the output of "ulimit -a", if its BSD use "limits".
 
You can also try inserting "-Xms2000m" before you set the max heap, so it would look like "-Xms2000m -Xmx2000m".

I'm also assuming you have at least 2g free of RAM, or even more?
 
----- Original Message ----
From: Gal Nitzan <[hidden email]>
To: [hidden email]
Sent: Friday, January 19, 2007 10:57:01 AM
Subject: java.lang.OutOfMemoryError - trunk


Thanks Sean,

I get out of memory errors.

I have set max heap for both nutch and hadoop 2000mb each but it doesn't
seem to affect anything. The out of memory happenes immediately after start
of a task.

Any idea?

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.io.Text.writeString(Text.java:399)
    at org.apache.nutch.parse.Outlink.write(Outlink.java:52)
    at org.apache.nutch.parse.ParseData.write(ParseData.java:163)
    at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:55)
    at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:323)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:96)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
    at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1367)
Reply | Threaded
Open this post in threaded view
|

RE: Reduce segment size

Ledio Ago
In reply to this post by Ledio Ago
Quick question:

It wont affect re-crawling as that's dependant on the Nutch DB, but it
will prevent you from re-indexing the data that was deleted as it needs
those files.
> Why would I want to reindex entries that I've deleted?
 
I have never tried running Nutch "just" with the index file, it might
work or it might not but its something to test (move them out of the
directory, but don't delete them).


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Thursday, January 18, 2007 8:57:15 PM
Subject: Reduce segment size


Hi there!

After a crawl/index cycle a segment directory is created which usually
contains content, index, and so on directories.
Here is what actually my current segment directory has after crawl/index
build of 2 Million URLs:

/segments/20070114151631> du -sh *
9.6G    content
212M    fetcher
5.0G    index
0       index.done
5.8G    parse_data
3.7G    parse_text

The segment directory is copied to a searcher.  As you can see the
content directory is huge.

My question is, if you just remove this directory, would that affect the
search capability, or later the recrawling and reindexing?
The content directory is so big, is there is a way not to have to copy
that directory to the searcher?

Thanks,
Ledio




Ledio Ago * Sr. Software Engineer * [hidden email]

w: 415-348-7693 * f: 415-348-7032



LookSmart - Where To Look For What You Need. - Find. Save. Share.

625 Second Street, San Francisco, CA 94107
Reply | Threaded
Open this post in threaded view
|

RE: java.lang.OutOfMemoryError - trunk

Gal Nitzan
In reply to this post by Sean Dean-3

Hi Sean,

Thanks for the prompt reply.

I'm using fc6 java 1.6.0, 8GB RAM.

I'll try your suggestion.

Gal


-----Original Message-----
From: Sean Dean [mailto:[hidden email]]
Sent: Friday, January 19, 2007 8:25 PM
To: [hidden email]
Subject: Re: java.lang.OutOfMemoryError - trunk

What OS are you using with Nutch, and what version of JVM?
 
If its Linux, paste the output of "ulimit -a", if its BSD use "limits".
 
You can also try inserting "-Xms2000m" before you set the max heap, so it
would look like "-Xms2000m -Xmx2000m".

I'm also assuming you have at least 2g free of RAM, or even more?
 
----- Original Message ----
From: Gal Nitzan <[hidden email]>
To: [hidden email]
Sent: Friday, January 19, 2007 10:57:01 AM
Subject: java.lang.OutOfMemoryError - trunk


Thanks Sean,

I get out of memory errors.

I have set max heap for both nutch and hadoop 2000mb each but it doesn't
seem to affect anything. The out of memory happenes immediately after start
of a task.

Any idea?

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.io.Text.writeString(Text.java:399)
    at org.apache.nutch.parse.Outlink.write(Outlink.java:52)
    at org.apache.nutch.parse.ParseData.write(ParseData.java:163)
    at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:55)
    at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:323)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:96)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
    at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1367)


Reply | Threaded
Open this post in threaded view
|

RE: java.lang.OutOfMemoryError - trunk

Gal Nitzan
In reply to this post by Sean Dean-3
PS.

Ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 138239
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
max rt priority                 (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 138239
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited



-----Original Message-----
From: Sean Dean [mailto:[hidden email]]
Sent: Friday, January 19, 2007 8:25 PM
To: [hidden email]
Subject: Re: java.lang.OutOfMemoryError - trunk

What OS are you using with Nutch, and what version of JVM?
 
If its Linux, paste the output of "ulimit -a", if its BSD use "limits".
 
You can also try inserting "-Xms2000m" before you set the max heap, so it
would look like "-Xms2000m -Xmx2000m".

I'm also assuming you have at least 2g free of RAM, or even more?
 
----- Original Message ----
From: Gal Nitzan <[hidden email]>
To: [hidden email]
Sent: Friday, January 19, 2007 10:57:01 AM
Subject: java.lang.OutOfMemoryError - trunk


Thanks Sean,

I get out of memory errors.

I have set max heap for both nutch and hadoop 2000mb each but it doesn't
seem to affect anything. The out of memory happenes immediately after start
of a task.

Any idea?

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.io.Text.writeString(Text.java:399)
    at org.apache.nutch.parse.Outlink.write(Outlink.java:52)
    at org.apache.nutch.parse.ParseData.write(ParseData.java:163)
    at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:55)
    at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:323)
    at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:96)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
    at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1367)


Reply | Threaded
Open this post in threaded view
|

Re: Reduce segment size

Sean Dean-3
In reply to this post by Ledio Ago
That would be dependant on your situation and what exactly your trying to accomplish with Nutch.
 
In my case the goal is to produce the largest possible index size, yet still keep it updated. I'm basically fetching 1-2 million segments then merging them together, and in this case I will always require the segment data (other then crawl_generate, which can be safely deleted after the fetch is done).
 
If you only need, for example 3 million documents in a index and you don't really care what they are then you could generate a brand new 3 million URL segment every time, fetch it, run the database functions and index it without really caring about re-indexing the previous segment since you did everything in one operation.


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Friday, January 19, 2007 1:36:57 PM
Subject: RE: Reduce segment size


Quick question:

It wont affect re-crawling as that's dependant on the Nutch DB, but it
will prevent you from re-indexing the data that was deleted as it needs
those files.
> Why would I want to reindex entries that I've deleted?

I have never tried running Nutch "just" with the index file, it might
work or it might not but its something to test (move them out of the
directory, but don't delete them).


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Thursday, January 18, 2007 8:57:15 PM
Subject: Reduce segment size


Hi there!

After a crawl/index cycle a segment directory is created which usually
contains content, index, and so on directories.
Here is what actually my current segment directory has after crawl/index
build of 2 Million URLs:

/segments/20070114151631> du -sh *
9.6G    content
212M    fetcher
5.0G    index
0       index.done
5.8G    parse_data
3.7G    parse_text

The segment directory is copied to a searcher.  As you can see the
content directory is huge.

My question is, if you just remove this directory, would that affect the
search capability, or later the recrawling and reindexing?
The content directory is so big, is there is a way not to have to copy
that directory to the searcher?

Thanks,
Ledio




Ledio Ago * Sr. Software Engineer * [hidden email]

w: 415-348-7693 * f: 415-348-7032



LookSmart - Where To Look For What You Need. - Find. Save. Share.

625 Second Street, San Francisco, CA 94107
Reply | Threaded
Open this post in threaded view
|

RE: Reduce segment size

Ledio Ago
That would be dependant on your situation and what exactly your trying
to accomplish with Nutch.

> Let me clear this up a little:

I have a crawl/index machine which crawls and index a fixed list of
URLs, no new discovery.  The generated index gets copied to searchers.
I've setup RamDisk on those boxes and the all index is loaded in memory
the index is partitioned).  To save memory, and to allow the machines to
have more of the index in Ram, I'd like to reduce the size of the
segments by removing unnecessary data, that are not used during the
searching process.  As you can see the original index still remains in
the index box and it will be used in the next crawl/index cycle.
 
... and in this case I will always require the segment data (other then
crawl_generate, which can be safely deleted after the fetch is done).

> What do you mean the "crawl_generate" data.  Are you talking about the
"content" directory?
 
Thanks Sean for your input,
Ledio


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Friday, January 19, 2007 1:36:57 PM
Subject: RE: Reduce segment size


Quick question:

It wont affect re-crawling as that's dependant on the Nutch DB, but it
will prevent you from re-indexing the data that was deleted as it needs
those files.
> Why would I want to reindex entries that I've deleted?

I have never tried running Nutch "just" with the index file, it might
work or it might not but its something to test (move them out of the
directory, but don't delete them).


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Thursday, January 18, 2007 8:57:15 PM
Subject: Reduce segment size


Hi there!

After a crawl/index cycle a segment directory is created which usually
contains content, index, and so on directories.
Here is what actually my current segment directory has after crawl/index
build of 2 Million URLs:

/segments/20070114151631> du -sh *
9.6G    content
212M    fetcher
5.0G    index
0       index.done
5.8G    parse_data
3.7G    parse_text

The segment directory is copied to a searcher.  As you can see the
content directory is huge.

My question is, if you just remove this directory, would that affect the
search capability, or later the recrawling and reindexing?
The content directory is so big, is there is a way not to have to copy
that directory to the searcher?

Thanks,
Ledio




Ledio Ago * Sr. Software Engineer * [hidden email]

w: 415-348-7693 * f: 415-348-7032



LookSmart - Where To Look For What You Need. - Find. Save. Share.

625 Second Street, San Francisco, CA 94107
Reply | Threaded
Open this post in threaded view
|

Re: Reduce segment size

Sean Dean-3
In reply to this post by Ledio Ago
One possibility, unless your doing this already is to load only the "index" directory/files onto the RamDisk and keep the rest of the segment data on regular disks. This should produce very little performance loss, if any at all. Beyond that, you wont be able to save any further memory (RAM) in your preferred setup.
 
The crawl_generate directory only holds the fetch list, but it seems your using Nutch 0.7.x? In this case, the "fetcher" directory/files can be deleted afterwards.


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Friday, January 19, 2007 2:34:59 PM
Subject: RE: Reduce segment size


That would be dependant on your situation and what exactly your trying
to accomplish with Nutch.

> Let me clear this up a little:

I have a crawl/index machine which crawls and index a fixed list of
URLs, no new discovery.  The generated index gets copied to searchers.
I've setup RamDisk on those boxes and the all index is loaded in memory
the index is partitioned).  To save memory, and to allow the machines to
have more of the index in Ram, I'd like to reduce the size of the
segments by removing unnecessary data, that are not used during the
searching process.  As you can see the original index still remains in
the index box and it will be used in the next crawl/index cycle.

... and in this case I will always require the segment data (other then
crawl_generate, which can be safely deleted after the fetch is done).

> What do you mean the "crawl_generate" data.  Are you talking about the
"content" directory?

Thanks Sean for your input,
Ledio


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Friday, January 19, 2007 1:36:57 PM
Subject: RE: Reduce segment size


Quick question:

It wont affect re-crawling as that's dependant on the Nutch DB, but it
will prevent you from re-indexing the data that was deleted as it needs
those files.
> Why would I want to reindex entries that I've deleted?

I have never tried running Nutch "just" with the index file, it might
work or it might not but its something to test (move them out of the
directory, but don't delete them).


----- Original Message ----
From: Ledio Ago <[hidden email]>
To: [hidden email]
Sent: Thursday, January 18, 2007 8:57:15 PM
Subject: Reduce segment size


Hi there!

After a crawl/index cycle a segment directory is created which usually
contains content, index, and so on directories.
Here is what actually my current segment directory has after crawl/index
build of 2 Million URLs:

/segments/20070114151631> du -sh *
9.6G    content
212M    fetcher
5.0G    index
0       index.done
5.8G    parse_data
3.7G    parse_text

The segment directory is copied to a searcher.  As you can see the
content directory is huge.

My question is, if you just remove this directory, would that affect the
search capability, or later the recrawling and reindexing?
The content directory is so big, is there is a way not to have to copy
that directory to the searcher?

Thanks,
Ledio




Ledio Ago * Sr. Software Engineer * [hidden email]

w: 415-348-7693 * f: 415-348-7032



LookSmart - Where To Look For What You Need. - Find. Save. Share.

625 Second Street, San Francisco, CA 94107
Reply | Threaded
Open this post in threaded view
|

Re: Reduce segment size

Andrzej Białecki-2
In reply to this post by Ledio Ago
Ledio Ago wrote:

> Hi there!
>  
> After a crawl/index cycle a segment directory is created which usually
> contains content, index, and so on directories.
> Here is what actually my current segment directory has after crawl/index
> build of 2 Million URLs:
>  
> /segments/20070114151631> du -sh *
> 9.6G    content
> 212M    fetcher
> 5.0G    index
> 0       index.done
> 5.8G    parse_data
> 3.7G    parse_text
>  

Searcher needs content only to display a cached preview of the page. It
doesn't need it for anything else. It doesn't need "fetcher" either. So,
the only parts that you have to copy is the index, parse_data and
parse_text.

(BTW. if you deploy only the index, you will be able to search and see
the title of the document and url, because they are stored in the index,
but you won't be able to get "snippets", i.e. fragments of matching
text, because this comes from parse_text).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Does nutch segments from hadoop .7.1 different from hadoop .10.1

Gal Nitzan
In reply to this post by Gal Nitzan
Hi,

Just updated to trunk and I get the following exception while doing
mergesegs on segments crawled/parsed on hadoop .7.1

java.lang.RuntimeException: readObject can't find class
        at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:179)
        at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61)
        at
org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.
java:385)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239
)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1367)
Caused by: java.lang.ClassNotFoundException: (a lot of garbage here)....
at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:315)
        at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:177)
        ... 7 more



Reply | Threaded
Open this post in threaded view
|

RE: Reduce segment size

Ledio Ago
In reply to this post by Andrzej Białecki-2
Makes sense.  Thanks for the responses guys.

-Ledio

-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Friday, January 19, 2007 12:22 PM
To: [hidden email]
Subject: Re: Reduce segment size

Ledio Ago wrote:
> Hi there!
>  
> After a crawl/index cycle a segment directory is created which usually

> contains content, index, and so on directories.
> Here is what actually my current segment directory has after
> crawl/index build of 2 Million URLs:
>  
> /segments/20070114151631> du -sh *
> 9.6G    content
> 212M    fetcher
> 5.0G    index
> 0       index.done
> 5.8G    parse_data
> 3.7G    parse_text
>  

Searcher needs content only to display a cached preview of the page. It
doesn't need it for anything else. It doesn't need "fetcher" either. So,
the only parts that you have to copy is the index, parse_data and
parse_text.

(BTW. if you deploy only the index, you will be able to search and see
the title of the document and url, because they are stored in the index,
but you won't be able to get "snippets", i.e. fragments of matching
text, because this comes from parse_text).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||
\|  ||  |  Embedded Unix, System Integration http://www.sigram.com
Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: java.lang.OutOfMemoryError - trunk

Espen Amble Kolstad-2
In reply to this post by Gal Nitzan
Hi,
I think the new sorting directly after the map job in hadoop-0.10.x
causes this. I had the same problem.
You could check io.sort.factor and io.sort.mb in conf/hadoop-site.xml.
Maybe lower atleast io.sort.mb ?

Maybe that helps?

- Espen

Gal Nitzan wrote:

> Hi Sean,
>
> Thanks for the prompt reply.
>
> I'm using fc6 java 1.6.0, 8GB RAM.
>
> I'll try your suggestion.
>
> Gal
>
>
> -----Original Message-----
> From: Sean Dean [mailto:[hidden email]]
> Sent: Friday, January 19, 2007 8:25 PM
> To: [hidden email]
> Subject: Re: java.lang.OutOfMemoryError - trunk
>
> What OS are you using with Nutch, and what version of JVM?
>  
> If its Linux, paste the output of "ulimit -a", if its BSD use "limits".
>  
> You can also try inserting "-Xms2000m" before you set the max heap, so it
> would look like "-Xms2000m -Xmx2000m".
>
> I'm also assuming you have at least 2g free of RAM, or even more?
>  
> ----- Original Message ----
> From: Gal Nitzan <[hidden email]>
> To: [hidden email]
> Sent: Friday, January 19, 2007 10:57:01 AM
> Subject: java.lang.OutOfMemoryError - trunk
>
>
> Thanks Sean,
>
> I get out of memory errors.
>
> I have set max heap for both nutch and hadoop 2000mb each but it doesn't
> seem to affect anything. The out of memory happenes immediately after start
> of a task.
>
> Any idea?
>
> java.lang.OutOfMemoryError: Java heap space
>     at java.util.Arrays.copyOf(Arrays.java:2786)
>     at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>     at java.io.DataOutputStream.write(DataOutputStream.java:90)
>     at org.apache.hadoop.io.Text.writeString(Text.java:399)
>     at org.apache.nutch.parse.Outlink.write(Outlink.java:52)
>     at org.apache.nutch.parse.ParseData.write(ParseData.java:163)
>     at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:55)
>     at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:323)
>     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:96)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
>     at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1367)
>
>
>