Stored Field vs "offset plus external file"?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Stored Field vs "offset plus external file"?

eks dev
I would like to try to replace our external storage of documents with Lucene stored field, so a few questions before we proceed:

Background: We store currently complete documents  in a simple binary file and only keep offsets into this file as a Stored field   in Lucene index. Documents (compressed) are short, avg 300-400bytes per document

What we would like to try is to store these documents as binary (compressed with simple/fast static huffman + dictionary) stored Fields in order to make maintenance of this setup easier as Lucene already does maintain updates and fetching of these fields, also indexing works very well from multiple threads now. Complete Document would be the only Stored field we have in index (I know about lazy loading).
Simply what we do today is , search index, get offsets from found documents and than fetch them from binary file.

So the questions would be:

1. Does anyone see any theoretical/practical reason why our homemade "fetch offset from lucene stored field-> jump to this offset in document" would be faster than "fetch stored field from Lucene directly"

2.  We use Lucene Index wit MMAP directory now, so the concern is that index could grow too large for MMAP with stored field like that.  Is there a way to say, "do not use MMAP Directory for stored Fields, rather FSDirectory". I think not, but it is worth to ask as I think this could be useful thing... Imagine to be able to say "Load terms with RAM Directory, postings wit MMAP and stored fields with FSDirectory"... of course this is only for searching, not indexing.

Thanks a lot.
Eks
 



      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Stored Field vs "offset plus external file"?

Andrzej Białecki-2
eks dev wrote:

> 2.  We use Lucene Index wit MMAP directory now, so the concern is
> that index could grow too large for MMAP with stored field like that.
> Is there a way to say, "do not use MMAP Directory for stored Fields,
> rather FSDirectory". I think not, but it is worth to ask as I think
> this could be useful thing... Imagine to be able to say "Load terms
> with RAM Directory, postings wit MMAP and stored fields with
> FSDirectory"... of course this is only for searching, not indexing.


IMHO it should be possible to do this by implementing a Directory
subclass that can be configured to use mmap / ram / fs for specific
index file types (e.g. tis, tii, fdt, prx and so on) - you should be
able to cut & paste large chunks of each directory code to start the
implementation.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]