Question about memory usage and file handling

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about memory usage and file handling

siddharth teotia
Hi All,

I have a few questions about Lucene indexing and file handling. It would be
great if someone can help with these. I had earlier asked these questions
on [hidden email] but was asked to seek help here.


(1) During indexing, is there any knob to tell the writer to use off-heap
for buffering. I didn't find anything in the docs so probably the answer is
no. Just confirming.

(2) I did some experiments with buffering threshold using
setMaxRAMBufferSizeMB() on IndexWriterConfig. I varied it from 16MB
(default), 128MB, 256MB and 512MB. The experiment was ingesting 5million
documents. It turns out that buffering threshold also controls the number
of files that are created in the index directory. In all the cases, I see
only 1 segment (since there was just one segments_1) file but there were
multiple .cfs files  -- _0.cfs, _1.cfs, _2.cfs, _3.cfs.

How can there be multiple cfs files when there is just one segment? My
understanding from the documentation was that all files for each segment
will have the same name but different extension. In this case, even though
there is only 1 segment, there are still cfs files. Does each flush result
in a new file?

The reason to do this experiment is to understand the number of open files
both while building the index and querying. I am not quite sure why I am
seeing multiple CFS files when there is only 1 segment. I was hoping there
would be only_0.cfs file.  This is true when buffer threshold is 512MB, but
there are 2 cfs files when threshold is set to 256MB, 5 cfs files when set
to 128MB and I didn't see the CFS file for the default 16MB threshold.
There were individual files (.fdx, .fdt, .tip etc). I thought by default
Lucene creates a compound file at least after the writer closes. Is that
not true?

I can see that during querying, only the cfs file is kept opened. But I
would like to understand a little bit about the number of cfs files and
based on that we can set the buffering threshold to control the heap
overhead while building the index.

(2) In my experiments, the writer commits and is closed after ingesting all
the 5million documents and after that there is no need for us to index
more. So essentially it is an immutable index. However, I want to
understand the threshold for creating a new segment. Is that pretty high?
Or if the writer is reopened, then the next set of documents will go into
the next segment and so on?

I would really appreciate some help with above questions.

Thanks,
Siddharth
Reply | Threaded
Open this post in threaded view
|

Re: Question about memory usage and file handling

Shawn Heisey-2
On 11/11/2019 1:40 PM, siddharth teotia wrote:
> I have a few questions about Lucene indexing and file handling. It would be
> great if someone can help with these. I had earlier asked these questions
> on [hidden email] but was asked to seek help here.

This mailing list (solr-user) is for Solr.  Questions about Lucene do
not belong on this list.

You should ask on the java-user mailing list, which is for questions
related to the core (Java) version of Lucene.

http://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg

I have put the original sender address in the BCC field just in case you
are not subscribed here.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Question about memory usage and file handling

Erick Erickson
(1) no. The internal Ram buffer will pretty much limit the amount of heap used however.

(2) You actually have several segments. “.cfs” stands for “Compound File”, see:

https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/codecs/lucene70/package-summary.html
"An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles.”

IOW, _0.cfs is a complete segment. _1.cfs is a different, complete segment etc. The merge policy (TieredMergePolicy) controls when these are used .vs. the segment being kept in separate files.

New segments are created whenever the ram buffer is flushed or whenever you do a commit (closing the IW also creates a segment IIUC). However, under control of the merge policy, segments are merged. See: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

You’re confusing closing a writer with merging segments. Essentially, every time a commit happens, the merge policy is called to determine if segments should be merged, see Mike’s blog above.

Additionally, you say "I was hoping there would be only _0.cfs file”. This’ll pretty much never happen. Segment names always increase, at best you’d have something like _ab.cfs, if not 10-15 _ab* files.

Lucene likes file handles, essentially when searching a file handle will be open for _every_ file in your index all the time.

All that said, counting the number of files seems like a waste of time. If you’re running on a *nix box, the usual (Solr I’ll admit, but I think it applies to Lucene as well) is to set the limit to 65K or so.

And if you’re truly concerned, and since you say this is an immutable, you can do a forceMerge. Prior to Lucene 7.5, the would by default form exactly one segment. For Lucene 7.5 and later, it’ll respect max segment size (a parameter in TMP, defaults to 5g) unless you specify a segment count of 1.

Best,
Erick

> On Nov 11, 2019, at 5:47 PM, Shawn Heisey <[hidden email]> wrote:
>
> On 11/11/2019 1:40 PM, siddharth teotia wrote:
>> I have a few questions about Lucene indexing and file handling. It would be
>> great if someone can help with these. I had earlier asked these questions
>> on [hidden email] but was asked to seek help here.
>
> This mailing list (solr-user) is for Solr.  Questions about Lucene do not belong on this list.
>
> You should ask on the java-user mailing list, which is for questions related to the core (Java) version of Lucene.
>
> http://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>
> I have put the original sender address in the BCC field just in case you are not subscribed here.
>
> Thanks,
> Shawn