Ok, I'm just writing this email because I'm confused about some code in
Lucene, which will almost certainly highlight a failure of mine to
fully grasp something probably very basic. I've had the Lucene in
Action book for a while, and generally been playing around with Lucene
on the side, but now we have a serious need for Lucene to index a lot
I keep hearing people say "can index millions of docs in a handful of
minutes", but I can't see this in action, even after fiddling with the
mergeFactors/maxMergeDocs etc etc. (even tried RAMDirectory batching,
but it may be quicker, but it's not that much quicker for me).
So, I'm posting my code and analysis, and the hprof trace from it in an
attempt to have as many people look at my code, laugh, and point out my
failings. Please be gentle with me...
First, here's my indexing testbed code (please ignore the RAMwriter, I
know it's not used in this example, but I was trialling things at
Even changing between Compound and not compound format doesn't really
change the fact that my Mac Powerbook, as well as dual Xeon 2.4 Ghz,
both with at least a GB of RAM are effectively CPU maxed out (only 1
CPU on the dual Xeon is maxed out obviously)..
The attached hprof logs from runs using the
"-Xrunhprof:cpu=samples,file=log.txt,depth=3" syntax highlight that:
This reads to me that it's spending a LOT of time in this method.
Looking at the current code in SVN, and debugging through, I see that
as each Document is added to this IndexWriter, the loop in
maybeMergeSegments gets called, and the loop spends a busy time looping
through each SegmentInfo instance. In my case, it seems that there is
only one document in each SegmentInfo . This means that each time a
new document is added, this loop gets bigger and bigger. Perhaps
someone can explain why SegmentInfo only contains 1 doc during this
Now I'm sure there is a good reason for this loop, but I'm just
highlighting that this seems to be the biggest CPU consumer of the
indexing operation as I see it. (again, it could just be my test
see also log.log which is the output of log4j as each 100 items is
added (ignore the "Finished RAM indexing.." log, you can see from above
it's not using the RAMWriter in this case.
Please, observations about why my testing setup is not accurate very
welcome. I just want to understand why I'm CPU bound. At this rate
I'm only getting on average 300-400-ish items/second added to the
index. Of course a faster CPU is always a good idea, but if there _is_ a way to optimize this area, one
could see a significant indexing speed increase? (move the bottleneck
back to IO I guess).
>>At this rate I'm
>>only getting on average 300-400-ish items/second added to the index.
>I think that's realistic for typical uses of Lucene on common hardware.
Thanks Daniel, that's at least comforting to know that it's at least
expected. Can you or anyone else comment on the CPU profile I sent in?
If there was a way of optimizing that loop, then it could mean a
reasonable improvement in indexing speed.