Assign To: (was: Lucene Developers)
Thanks Christian. I think LUCENE-545 provided the solution to selective field loading now.
> [PATCH] Added support for segmented field data files and cached directories
> Key: LUCENE-196
> URL: http://issues.apache.org/jira/browse/LUCENE-196 > Project: Lucene - Java
> Type: Improvement
> Components: Index
> Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: All
> Platform: All
> Reporter: Christian KohlschÃ¼tter
> Priority: Minor
> Attachments: docStore-patch.txt, docStore-test-patch.txt, docStore-test-patch.txt, docStore-test-patch.txt, newDocStore-patch.txt, newDocStore-test-patch.txt
> I would like to contribute the following enhancement, hoping that it would be
> as useful for you as it is for me.
> For one of my applications, it was necessary to reprocess the Documents
> returned by a search in a Lucene index according to some Field values (for
> applying an "edit distance" function on unindexed fields, in my case).
> Because Lucene has to load every possibly relevant document (*all* fields,
> including the ones which are irrelevant for the algorithm) from disk into
> memory for this operation - doing so is extensively time-consuming.
> As far as I can see, currently, there is no satisfying solution to improve
> this situation except buffering all data in RAM using a RAMDirectory.
> But what if the field data is just too big to fit in RAM?
> My patch will handle this by splitting the monolithic "*.fdt"-Field data file
> into several "data store" files .fdt, .fd1, .fd2 and so on.
> These "data store" files are connected as a linked-list which permits you to
> load only the part of the field data that is relevant for the current
> So, you can load all field data (as in the current implementation), or the
> fields from a specific interval [0;n] of data stores. Store 0 represents the
> data in the ".fdt" file, all data stores with ids > 0 are represented by files
> ".fd1", ".fd2", and so on.
> In my case, I would then simply cache the ".fdt" (data store 0) file in RAM
> (using a symbolic link to shm-/tmp), but leave all other .fd* files on
> harddisk. The .fdt file only contains the relevant field for my algorithm
> (which therefore remains quite small); all the other fields are stored in the
> rather big ".fd0" file. So, accessing Fields in .fdt requires no disk I/O,
> which speeds up things remarkably.
> You can compare this feature with having multiple tables in a relational
> database that are linked with 1..1 cardinality instead of having one big
> My proposed enhancement requires some API additions, which I try to explain
> To specify the desired data store for a Field, simply call the new method
> "Field setDataStore(int)" (docstore 0 is the default):
> doc.add(Field.Keyword("fieldA", "this is in docstore 0"));
> doc.add(Field.Keyword("fieldB", "this is in docstore 1").setDataStore(1));
> In this example, fieldA would be stored in ".fdt"; fieldB in ".fd1".
> When you retrieve the Document object (example docId = 123) using an
> IndexReader, you have the following options:
> "indexReader.document(123)" would load all fields from all data stores.
> "indexReader.document(123, 0)" would load only the fields from data store 0.
> "indexReader.document(123, 1)" would explictly load only the fields from data
> stores 0 and 1.
> The method "IndexReader.document(int n, int k)" is defined to fetch all fields
> from all data stores *at least* up to ID k. That way, existing IndexReader
> subclasses do not have to be modified, as I provide an overridable method in
> IndexReader which simply calls document(int n).
> A more concrete example is attached to this feature request as a
> JUnit-Testcase, as well as the patch itself.
> Have fun with it!
> Best regards,
> Christian Kohlschuetter