[jira] Resolved: (LUCENE-196) [PATCH] Added support for segmented field data files and cached directories

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-196) [PATCH] Added support for segmented field data files and cached directories

Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-196?page=all ]
     
Otis Gospodnetic resolved LUCENE-196:
-------------------------------------

    Resolution: Duplicate
     Assign To:     (was: Lucene Developers)

Thanks Christian.  I think LUCENE-545 provided the solution to selective field loading now.

> [PATCH] Added support for segmented field data files and cached directories
> ---------------------------------------------------------------------------
>
>          Key: LUCENE-196
>          URL: http://issues.apache.org/jira/browse/LUCENE-196
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Index
>     Versions: CVS Nightly - Specify date in submission
>  Environment: Operating System: All
> Platform: All
>     Reporter: Christian Kohlschütter
>     Priority: Minor
>  Attachments: docStore-patch.txt, docStore-test-patch.txt, docStore-test-patch.txt, docStore-test-patch.txt, newDocStore-patch.txt, newDocStore-test-patch.txt
>
> Hello,
>  
> I would like to contribute the following enhancement, hoping that it would be
> as useful for you as it is for me.
>  
> For one of my applications, it was necessary to reprocess the Documents
> returned by a search in a Lucene index according to some Field values (for
> applying an "edit distance" function on unindexed fields, in my case).
>  
> Because Lucene has to load every possibly relevant document (*all* fields,
> including the ones which are irrelevant for the algorithm) from disk into
> memory for this operation - doing so is extensively time-consuming.
>  
> As far as I can see, currently, there is no satisfying solution to improve
> this situation except buffering all data in RAM using a RAMDirectory.
>  
> But what if the field data is just too big to fit in RAM?
>  
> My patch will handle this by splitting the monolithic "*.fdt"-Field data file
> into several "data store" files .fdt, .fd1, .fd2 and so on.
>  
> These "data store" files are connected as a linked-list which permits you to
> load only the part of the field data that is relevant for the current
> operation.
>  
> So, you can load all field data (as in the current implementation), or the
> fields from a specific interval [0;n] of data stores. Store 0 represents the
> data in the ".fdt" file, all data stores with ids > 0 are represented by files
> ".fd1", ".fd2", and so on.
>  
> In my case, I would then simply cache the ".fdt" (data store 0) file in RAM
> (using a symbolic link to shm-/tmp), but leave all other .fd* files on
> harddisk. The .fdt file only contains the relevant field for my algorithm
> (which therefore remains quite small); all the other fields are stored in the
> rather big ".fd0" file. So, accessing Fields in .fdt requires no disk I/O,
> which speeds up things remarkably.
>  
> You can compare this feature with having multiple tables in a relational
> database that are linked with 1..1 cardinality instead of having one big
> table.
>  
> My proposed enhancement requires some API additions, which I try to explain
> now.
>  
> To specify the desired data store for a Field, simply call the new method
> "Field setDataStore(int)" (docstore 0 is the default):
> doc.add(Field.Keyword("fieldA", "this is in docstore 0"));
> doc.add(Field.Keyword("fieldB", "this is in docstore 1").setDataStore(1));
>  
> In this example, fieldA would be stored in ".fdt"; fieldB in ".fd1".
>  
> When you retrieve the Document object (example docId = 123) using an
> IndexReader, you have the following options:
> "indexReader.document(123)" would load all fields from all data stores.
> "indexReader.document(123, 0)" would load only the fields from data store 0.
> "indexReader.document(123, 1)" would explictly load only the fields from data
> stores 0 and 1.
>  
> The method "IndexReader.document(int n, int k)" is defined to fetch all fields
> from all data stores *at least* up to ID k. That way, existing IndexReader
> subclasses do not have to be modified, as I provide an overridable method in
> IndexReader which simply calls document(int n).
>  
> A more concrete example is attached to this feature request as a
> JUnit-Testcase, as well as the patch itself.
>  
> Have fun with it!
>  
>  
> Best regards,
>  
> Christian Kohlschuetter

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]