Seeing what's occupying all the space in the index

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Seeing what's occupying all the space in the index

Rob Staveley (Tom)
I am indexing e-mail in a compound index and for e-mail which is stored in
~60G (in Bzip2 compressed form), I have an index which is now 80G.

Is there a tool I can use to see how much of the index is occupied by the
different fields I am indexing?

PS: I am a newbie to the mailing list - I hope I've got the etiquette right

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
In my index of e-mail message parts, it looks like 23K is being used up for
each indexed message part, which is way more than I'd expect.

I have a total of 37 fields per message part.
I tokenize, index and do not store message part bodies.
I store a <= 300 character synopsis of each message part.
All of the other fields are message metadata, which is tokenized, indexed
and stored but these rarely exceed 100 characters - they are for example To,
From, Cc, Subject, Date

I'm still using Lucene 1.4.3, but am in the process of migrating to 1.9.

Is there any way that I can get a picture of what's occupying all the space?

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Seeing what's occupying all the space in the index

Grant Ingersoll
Give Luke a try.  Google for "Luke Lucene" and you should find it.  
Otherwise check the Lucene website for a reference.

Rob Staveley (Tom) wrote:

> In my index of e-mail message parts, it looks like 23K is being used up for
> each indexed message part, which is way more than I'd expect.
>
> I have a total of 37 fields per message part.
> I tokenize, index and do not store message part bodies.
> I store a <= 300 character synopsis of each message part.
> All of the other fields are message metadata, which is tokenized, indexed
> and stored but these rarely exceed 100 characters - they are for example To,
> From, Cc, Subject, Date
>
> I'm still using Lucene 1.4.3, but am in the process of migrating to 1.9.
>
> Is there any way that I can get a picture of what's occupying all the space?
>  

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
The server is headless (i.e. no X-Windows). I've tried lucli, but that
doesn't have Luke's whistles and bells. Does Luke have a non-GUI equivalent,
Grant?

-----Original Message-----
From: Grant Ingersoll [mailto:[hidden email]]
Sent: 26 May 2006 12:41
To: [hidden email]
Subject: Re: Seeing what's occupying all the space in the index

Give Luke a try.  Google for "Luke Lucene" and you should find it.  
Otherwise check the Lucene website for a reference.

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Seeing what's occupying all the space in the index

Grant Ingersoll
I don't believe it does.  Is there anyway you can mount the drive where
the index lives?  Can you copy the index to someplace that allows you to
run Luke?

Otherwise, you could write a simple standalone program that dumps the
terms and their freqs from the command line.  I don't think it would
take too many lines of code.  I believe somewhere in the contrib package
there is some code named HighFreqTerms.java that will dump out the
highest n occurring terms.

Rob Staveley (Tom) wrote:

> The server is headless (i.e. no X-Windows). I've tried lucli, but that
> doesn't have Luke's whistles and bells. Does Luke have a non-GUI equivalent,
> Grant?
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:[hidden email]]
> Sent: 26 May 2006 12:41
> To: [hidden email]
> Subject: Re: Seeing what's occupying all the space in the index
>
> Give Luke a try.  Google for "Luke Lucene" and you should find it.  
> Otherwise check the Lucene website for a reference.
>  

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Seeing what's occupying all the space in the index

Karel Tejnora
Or you can use ssh -X for X11 forwarding. I don't know how it's working
in windows (some x client app) but great on linux(es) with huge bandwidth.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
Luke is working nicely with a XWin32 demo server, I just downloaded from
StarNet, with a bit of SSH tunnelling :-)  [I couldn't immediately figure
out how to do it with Cygwin/X.]

However, I can't see how Luke is going to show me what's occupying most of
my index.

-----Original Message-----
From: Karel Tejnora [mailto:[hidden email]]
Sent: 26 May 2006 14:42
To: [hidden email]
Subject: Re: Seeing what's occupying all the space in the index

Or you can use ssh -X for X11 forwarding. I don't know how it's working in
windows (some x client app) but great on linux(es) with huge bandwidth.

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
 > I can't see how Luke is going to show me what's occupying most of my
index.

I do however notice that none of my stored fields are stored compressed.
Presumably Field.Store COMPRESS is something that is new in Lucene 1.9 and
wasn't available in 1.4.3??  However, it is still hard to see what's causing
the index to grow by 25kB with each document I index.

Is there anything I can learn from the index directory's file listing?
Here's the top of file listing in descending order of size from the index
directory:

--------8<--------
rob@dev:~/dat/indexd/index-1$ ls -lS | more
total 95374800
-rw-r--r--    1 rob dev 4449248724 May 26 00:32 _2fyfe.cfs
-rw-r--r--    1 rob dev 2522952122 May 26 14:17 _2k5vi.fdt
-rw-r--r--    1 rob dev 2413775516 May 26 01:16 _2g6l3.fdt
-rw-r--r--    1 rob dev 2368881846 May 25 18:14 _2ehe7.fdt
-rw-r--r--    1 rob dev 2344670598 May 25 16:31 _2dn69.fdt
-rw-r--r--    1 rob dev 2324315860 May 25 14:10 _2cxea.fdt
-rw-r--r--    1 rob dev 2259070168 May 25 10:28 _2aeb0.fdt
-rw-r--r--    1 rob dev 2113143078 May 24 13:39 _24xxv.fdt
-rw-r--r--    1 rob dev 2005876265 May 23 16:47 _21143.fdt
-rw-r--r--    1 rob dev 1994658169 May 23 16:09 _20mgq.fdt
-rw-r--r--    1 rob dev 1991402285 May 23 14:21 _20id9.fdt
-rw-r--r--    1 rob dev 1973739578 May 23 12:17 _1zvpr.fdt
-rw-r--r--    1 rob dev 1964392156 May 23 11:14 _1zjra.fdt
-rw-r--r--    1 rob dev 1957195484 May 23 10:27 _1zanx.fdt
-rw-r--r--    1 rob dev 1940435968 May 12 13:58 _1374v.cfs
-rw-r--r--    1 rob dev 1932876050 May 23 08:34 _1yep1.fdt
-rw-r--r--    1 rob dev 1908759860 May 22 22:26 _1xhs9.fdt
-rw-r--r--    1 rob dev 1862224271 May 22 14:24 _1vxc8.fdt
-rw-r--r--    1 rob dev 1775652952 May 21 19:15 _1slqr.fdt
-rw-r--r--    1 rob dev 1674336305 May 20 10:10 _1oaje.fdt
-rw-r--r--    1 rob dev 1641906176 May 17 12:24 _1f3lp.cfs
-rw-r--r--    1 rob dev 1626645390 May 19 16:39 _1mjaa.fdt
-rw-r--r--    1 rob dev 1412089155 May 17 12:21 _1f3lp.fdt
-rw-r--r--    1 rob dev 1400399872 May 19 16:51 _1mjaa.cfs
-rw-r--r--    1 rob dev 1090611130 May 12 13:51 _1374v.fdt
-rw-r--r--    1 rob dev 1059463168 May 16 09:24 _1azb3.fdt
-rw-r--r--    1 rob dev 1052419072 May 11 16:36 _1168u.cfs
-rw-r--r--    1 rob dev 1034522679 May 11 16:34 _1168u.fdt
-rw-r--r--    1 rob dev 1023033344 May 23 14:51 _20m4q.fdt
-rw-r--r--    1 rob dev 857224192 May  4 18:16 _lluq.cfs
-rw-r--r--    1 rob dev 821198047 May 26 14:21 _2k5vi.prx
-rw-r--r--    1 rob dev 808694510 May  8 14:09 _sunn.fdt
-rw-r--r--    1 rob dev 786164503 May 26 01:26 _2g6l3.prx
-rw-r--r--    1 rob dev 780030917 May 26 14:21 _2k5vi.frq
-rw-r--r--    1 rob dev 772484883 May 25 18:19 _2ehe7.prx
-rw-r--r--    1 rob dev 763351845 May 25 16:41 _2dn69.prx
-rw-r--r--    1 rob dev 755794097 May 25 14:14 _2cxea.prx
-rw-r--r--    1 rob dev 745972375 May 26 01:26 _2g6l3.frq
-rw-r--r--    1 rob dev 732753582 May 25 10:45 _2aeb0.prx
-rw-r--r--    1 rob dev 732496935 May 25 18:19 _2ehe7.frq
-rw-r--r--    1 rob dev 724428884 May 25 16:41 _2dn69.frq
-rw-r--r--    1 rob dev 719733760 May 25 09:49 _29vyk.fdt
-rw-r--r--    1 rob dev 717613127 May 25 14:14 _2cxea.frq
-rw-r--r--    1 rob dev 696849854 May 25 10:45 _2aeb0.frq
-rw-r--r--    1 rob dev 686227498 May 24 13:59 _24xxv.prx
--------8<--------

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Run-Time Error

Andrzej Białecki-2
In reply to this post by Rob Staveley (Tom)
Dennis Kubes wrote:
> The server is headless (i.e. no X-Windows). I've tried lucli, but that
> doesn't have Luke's whistles and bells. Does Luke have a non-GUI equivalent,
> Grant?
>  

You can tunnel your X session through ssh. If that's not possible, AND
you are familiar with Lucene API, then you can use BeanShell - just put
the bsh*.jar in lib/, and then do:

# bin/nutch bsh.Interpreter
BeanShell 2.0b4 - by Pat Niemeyer ([hidden email])
bsh % import org.apache.lucene.index.*;
bsh % import org.apache.lucene.document.*;
bsh % ir = IndexReader.open("indexes/part-00001");
bsh % print(ir.numDocs());
1524567
bsh %

Have fun!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Chris Hostetter-3
In reply to this post by Rob Staveley (Tom)

are you by any chance using different field names for each document -- or
do you have a wide range of field names that aren't the same for each
document? ... you mentioned indexing emails, email has a very loose header
structure that allows MTAs to add arbitrary "X" headers, are you
converting every header to an indexed field?

the reason i ask is that anytime you have an indexed field with and you
don't OMIT_NORMS when you add the field, lucene allocates one byte per
document for that field to store the norm value -- even most of those
documents don't have that field.

if you've got ~25,000 documents in your index, and you add a new document
with an indexed field no other document has, then you'll see at least a
25K increase in index size because of those norms.

(OMIT_NORMS is your friend).



: Date: Fri, 26 May 2006 15:50:43 +0100
: From: "Rob Staveley (Tom)" <[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: RE: Seeing what's occupying all the space in the index
:
:  > I can't see how Luke is going to show me what's occupying most of my
: index.
:
: I do however notice that none of my stored fields are stored compressed.
: Presumably Field.Store COMPRESS is something that is new in Lucene 1.9 and
: wasn't available in 1.4.3??  However, it is still hard to see what's causing
: the index to grow by 25kB with each document I index.
:
: Is there anything I can learn from the index directory's file listing?
: Here's the top of file listing in descending order of size from the index
: directory:
:
: --------8<--------
: rob@dev:~/dat/indexd/index-1$ ls -lS | more
: total 95374800
: -rw-r--r--    1 rob dev 4449248724 May 26 00:32 _2fyfe.cfs
: -rw-r--r--    1 rob dev 2522952122 May 26 14:17 _2k5vi.fdt
: -rw-r--r--    1 rob dev 2413775516 May 26 01:16 _2g6l3.fdt
: -rw-r--r--    1 rob dev 2368881846 May 25 18:14 _2ehe7.fdt
: -rw-r--r--    1 rob dev 2344670598 May 25 16:31 _2dn69.fdt
: -rw-r--r--    1 rob dev 2324315860 May 25 14:10 _2cxea.fdt
: -rw-r--r--    1 rob dev 2259070168 May 25 10:28 _2aeb0.fdt
: -rw-r--r--    1 rob dev 2113143078 May 24 13:39 _24xxv.fdt
: -rw-r--r--    1 rob dev 2005876265 May 23 16:47 _21143.fdt
: -rw-r--r--    1 rob dev 1994658169 May 23 16:09 _20mgq.fdt
: -rw-r--r--    1 rob dev 1991402285 May 23 14:21 _20id9.fdt
: -rw-r--r--    1 rob dev 1973739578 May 23 12:17 _1zvpr.fdt
: -rw-r--r--    1 rob dev 1964392156 May 23 11:14 _1zjra.fdt
: -rw-r--r--    1 rob dev 1957195484 May 23 10:27 _1zanx.fdt
: -rw-r--r--    1 rob dev 1940435968 May 12 13:58 _1374v.cfs
: -rw-r--r--    1 rob dev 1932876050 May 23 08:34 _1yep1.fdt
: -rw-r--r--    1 rob dev 1908759860 May 22 22:26 _1xhs9.fdt
: -rw-r--r--    1 rob dev 1862224271 May 22 14:24 _1vxc8.fdt
: -rw-r--r--    1 rob dev 1775652952 May 21 19:15 _1slqr.fdt
: -rw-r--r--    1 rob dev 1674336305 May 20 10:10 _1oaje.fdt
: -rw-r--r--    1 rob dev 1641906176 May 17 12:24 _1f3lp.cfs
: -rw-r--r--    1 rob dev 1626645390 May 19 16:39 _1mjaa.fdt
: -rw-r--r--    1 rob dev 1412089155 May 17 12:21 _1f3lp.fdt
: -rw-r--r--    1 rob dev 1400399872 May 19 16:51 _1mjaa.cfs
: -rw-r--r--    1 rob dev 1090611130 May 12 13:51 _1374v.fdt
: -rw-r--r--    1 rob dev 1059463168 May 16 09:24 _1azb3.fdt
: -rw-r--r--    1 rob dev 1052419072 May 11 16:36 _1168u.cfs
: -rw-r--r--    1 rob dev 1034522679 May 11 16:34 _1168u.fdt
: -rw-r--r--    1 rob dev 1023033344 May 23 14:51 _20m4q.fdt
: -rw-r--r--    1 rob dev 857224192 May  4 18:16 _lluq.cfs
: -rw-r--r--    1 rob dev 821198047 May 26 14:21 _2k5vi.prx
: -rw-r--r--    1 rob dev 808694510 May  8 14:09 _sunn.fdt
: -rw-r--r--    1 rob dev 786164503 May 26 01:26 _2g6l3.prx
: -rw-r--r--    1 rob dev 780030917 May 26 14:21 _2k5vi.frq
: -rw-r--r--    1 rob dev 772484883 May 25 18:19 _2ehe7.prx
: -rw-r--r--    1 rob dev 763351845 May 25 16:41 _2dn69.prx
: -rw-r--r--    1 rob dev 755794097 May 25 14:14 _2cxea.prx
: -rw-r--r--    1 rob dev 745972375 May 26 01:26 _2g6l3.frq
: -rw-r--r--    1 rob dev 732753582 May 25 10:45 _2aeb0.prx
: -rw-r--r--    1 rob dev 732496935 May 25 18:19 _2ehe7.frq
: -rw-r--r--    1 rob dev 724428884 May 25 16:41 _2dn69.frq
: -rw-r--r--    1 rob dev 719733760 May 25 09:49 _29vyk.fdt
: -rw-r--r--    1 rob dev 717613127 May 25 14:14 _2cxea.frq
: -rw-r--r--    1 rob dev 696849854 May 25 10:45 _2aeb0.frq
: -rw-r--r--    1 rob dev 686227498 May 24 13:59 _24xxv.prx
: --------8<--------
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Seeing what's occupying all the space in the index

Chris Hostetter-3
In reply to this post by Rob Staveley (Tom)

: PS: I am a newbie to the mailing list - I hope I've got the etiquette right

you may have figured this out already, but please CC email to
multiple lucene mailing lists -- in this particular case,
lucene-users@jakarta is just a legacy alias that points at java-user@lucene -- so
there's *really* no reason to send to both.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
In reply to this post by Rob Staveley (Tom)
> Is there anything I can learn from the index directory's file listing?

Running this nasty little BASH one-liner...

$ for i in `ls * | perl -nle 'if (/^.+(\..+)/) {print $1;}' | sort |
uniq`;do ls -l *$i | awk '{SUM = SUM + $5} END {if (SUM > 1e10) {print
"'$i': ", SUM}}'; done      

... I see....

        .cfs:  1.23155e+10
        .fdt:  5.06108e+10
        .frq:  1.27472e+10
        .prx:  1.3444e+10

That means I have 98 GB of files, with:

        51 GB devoted to field data (.fdt),
        13 BG devoted to term positions (.prx)
        13 BG devoted to term frequencies (.frq)
        12 BG devoted to compound files for the field index (.cfs)

Does that seem reasonable, bearing in mind I have only indexed 4.3 million
Lucene documents? That's 22.8 kB per Lucene document, and apart from a 300
character synopsis the fields are all much less than 100 characters long,
and yet this suggests that the index is providing 600 bytes per field.


smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
In reply to this post by Chris Hostetter-3
That's a really good idea, but I've got a total of 38 fields only. It is
true that some of them are empty, but that can't account for the bulk.

-----Original Message-----
From: Chris Hostetter [mailto:[hidden email]]
Sent: 26 May 2006 17:50
To: [hidden email]
Subject: RE: Seeing what's occupying all the space in the index


are you by any chance using different field names for each document -- or do
you have a wide range of field names that aren't the same for each document?
... you mentioned indexing emails, email has a very loose header structure
that allows MTAs to add arbitrary "X" headers, are you converting every
header to an indexed field?

the reason i ask is that anytime you have an indexed field with and you
don't OMIT_NORMS when you add the field, lucene allocates one byte per
document for that field to store the norm value -- even most of those
documents don't have that field.

if you've got ~25,000 documents in your index, and you add a new document
with an indexed field no other document has, then you'll see at least a 25K
increase in index size because of those norms.

(OMIT_NORMS is your friend).

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Seeing what's occupying all the space in the index

Grant Ingersoll
In reply to this post by Rob Staveley (Tom)
It seems odd to me that if you are using the CFS format, why you would
have the .fdt, .frq and .prx files in addition to the .cfs files.  My
understanding is all files (except deletable and segment) get put inside
of the CFS file.  Looking at my indices, I only have the CFS file.  Are
you optimizing your indices after you are done indexing?  Are you
turning off compound file format?

Can you try a smaller sample in a clean directory and see what size it
is (so that it doesn't take as long to index)?

Rob Staveley (Tom) wrote:

>> Is there anything I can learn from the index directory's file listing?
>>    
>
> Running this nasty little BASH one-liner...
>
> $ for i in `ls * | perl -nle 'if (/^.+(\..+)/) {print $1;}' | sort |
> uniq`;do ls -l *$i | awk '{SUM = SUM + $5} END {if (SUM > 1e10) {print
> "'$i': ", SUM}}'; done      
>
> ... I see....
>
> .cfs:  1.23155e+10
> .fdt:  5.06108e+10
> .frq:  1.27472e+10
> .prx:  1.3444e+10
>
> That means I have 98 GB of files, with:
>
> 51 GB devoted to field data (.fdt),
> 13 BG devoted to term positions (.prx)
> 13 BG devoted to term frequencies (.frq)
> 12 BG devoted to compound files for the field index (.cfs)
>
> Does that seem reasonable, bearing in mind I have only indexed 4.3 million
> Lucene documents? That's 22.8 kB per Lucene document, and apart from a 300
> character synopsis the fields are all much less than 100 characters long,
> and yet this suggests that the index is providing 600 bytes per field.
>
>  

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
Interesting. I am explicitly turning on the compound file format when I
start my application, but I am suspicious about my optimizing thread. It
*ought* to be optimising every 30 minutes, using thread synchronisation to
prevent the writer from trying to write while optimisation takes place, but
it is possible that I'm screwing up there (I'll add some diagnostics to
check that optimisation and index writing are mutually exclusive). When I
stopped my daemon and manually optimised, it took 11 minutes to optimise the
index. Is your understanding that .fdt, .frq and .prx files are working
files pre-optimisation and then when optimize() is called they should all
get absorbed into the .cfs? Manual optimisation only clawed back 1G, but I
didn't look to see if .fdt, .frq and .prx files were absorbed into the .cfs
files in the process. I'll investigate that now.

> Can you try a smaller sample in a clean directory and see what size it is
(so that it doesn't take as long to index)?

I'll try tee-ing off a message feed and index in a new index. I'm working
with a live message feed.

-----Original Message-----
From: Grant Ingersoll [mailto:[hidden email]]
Sent: 26 May 2006 18:38
To: [hidden email]
Subject: Re: Seeing what's occupying all the space in the index

It seems odd to me that if you are using the CFS format, why you would have
the .fdt, .frq and .prx files in addition to the .cfs files.  My
understanding is all files (except deletable and segment) get put inside of
the CFS file.  Looking at my indices, I only have the CFS file.  Are you
optimizing your indices after you are done indexing?  Are you turning off
compound file format?

Can you try a smaller sample in a clean directory and see what size it is
(so that it doesn't take as long to index)?

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Seeing what's occupying all the space in the index

Doug Cutting
In reply to this post by Rob Staveley (Tom)
Rob Staveley (Tom) wrote:
> Is there a tool I can use to see how much of the index is occupied by the
> different fields I am indexing?

Note that IndexReader has a main() that will list the contents of
compound index files.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
In reply to this post by Rob Staveley (Tom)
I just tried to optimise my index, using the lucli command line client, and
got:

--------8<--------
lucli> optimize
Starting to optimize index.
java.io.IOException: Cannot overwrite:
/mnt/sdb1/lucene-index/index-1/_2lhqi.fnm
        at
org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:322)
        at org.apache.lucene.index.FieldInfos.write(FieldInfos.java:255)
        at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:176)
        at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
        at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:707)
        at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:684)
        at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:543)
        at lucli.LuceneMethods.optimize(LuceneMethods.java:192)
        at lucli.Lucli.handleCommand(Lucli.java:223)
        at lucli.Lucli.<init>(Lucli.java:148)
        at lucli.Lucli.main(Lucli.java:175)
--------8<--------

It isn't a permissions issue - there are read+write permissions on
_2lhqi.fnm for the user. Does that mean that I have a file called
_2lhqi.fnm, which shouldn't be there - possibly from a previous attempt to
optimise the index, which got interrupted?

Bearing in mind that the index is 4 million documents, I'm reluctant to
re-index everything. If temporary files persist after an interrupted
optimize(), what's the wise/expedient thing to do under these circumstances?

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
In reply to this post by Doug Cutting
> Note that IndexReader has a main() that will list the contents of compound
index files.

It looks like some of my index is compound and some isn't. My not very well
informed guess is that an optimize() got interrupted somewhere along the
line.

If I try to optimize the index now, it throws exceptions.

lucli> optimize
Starting to optimize index.
java.io.IOException: Cannot overwrite: /mnt/sdb1/index/index-1/_2lhqi.fnm
        at
org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:322)
        at org.apache.lucene.index.FieldInfos.write(FieldInfos.java:255)
        at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:176)
        at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
        at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:707)
        at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:684)
        at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:543)
        at lucli.LuceneMethods.optimize(LuceneMethods.java:192)
        at lucli.Lucli.handleCommand(Lucli.java:223)
        at lucli.Lucli.<init>(Lucli.java:148)
        at lucli.Lucli.main(Lucli.java:175)

Would the smart thing to do at this point be to use IndexReader's main()
method to extract the contents of the compound files and then to attempt to
merge them with the unmerged indexes? [I'll need to delve further into
Doug's excellent LIA to figure out how to do this.]

To recap, I have 98 GB of files in my index, with:

        51 GB devoted to field data (.fdt),
        13 GB devoted to term positions (.prx)
        13 GB devoted to term frequencies (.frq)
        12 GB devoted to compound files for the field index (.cfs)

I seem to have the same mess on a parallel system too - I'm indexing in two
data centres.

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Seeing what's occupying all the space in the index

Rob Staveley (Tom)
In reply to this post by Rob Staveley (Tom)
Indexing 55648 documents in a new clean directory, I see only .cfs files (+
deletable  + segments). Disk usage is 65K for all of these, which means that
each message takes ~1K of index space rather than > 10K as it does in my
99GB index.

Bearing in mind that the large index has > 5 million Lucene documents
indexed in it now, do you reckon I can merge the .fdt, .prx and .frq into a
compound index?

-----Original Message-----
From: Grant Ingersoll [mailto:[hidden email]]
Sent: 26 May 2006 18:38
To: [hidden email]
Subject: Re: Seeing what's occupying all the space in the index

> Can you try a smaller sample in a clean directory and see what size it is
(so that it doesn't take as long to index)?

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Seeing what's occupying all the space in the index

Grant Ingersoll
It kind of sounds like those files are corrupted, but I can't say for
sure.  When you look in Luke at your index (the one with all the files,
not the new one) do you see all the documents you would expect to see
with values that seem reasonable?  Also, in Luke, you can see a listing
of all the files it thinks are in the index, do they match with what you
see via a file listing on the command line?

Also, you may want to see if you have any stale locks or the like that
is preventing you from doing an optimize.

Rob Staveley (Tom) wrote:

> Indexing 55648 documents in a new clean directory, I see only .cfs files (+
> deletable  + segments). Disk usage is 65K for all of these, which means that
> each message takes ~1K of index space rather than > 10K as it does in my
> 99GB index.
>
> Bearing in mind that the large index has > 5 million Lucene documents
> indexed in it now, do you reckon I can merge the .fdt, .prx and .frq into a
> compound index?
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:[hidden email]]
> Sent: 26 May 2006 18:38
> To: [hidden email]
> Subject: Re: Seeing what's occupying all the space in the index
>
>  
>> Can you try a smaller sample in a clean directory and see what size it is
>>    
> (so that it doesn't take as long to index)?
>  

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12