RE: "Catalog" backend for document stored fields?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

RE: "Catalog" backend for document stored fields?

Robichaud, Jean-Philippe-2
[sorry for the long delay for my answer, we are having some issues with our
mail server...]

Thanks for your comment.  Yes it would make sense if the log files were not
so big.  In fact, I'm only indexing a subset of the log information.
Because I store the information in Lucene, it is easier and faster to
retrieve the information.  

For example, generating a report by reading the logs themselves takes ~18
hours.  Converting to xml and indexing the logs takes ~12 hours and each
report takes <20 minutes to generate.  Since we often generate (different)
reports, the Lucene approach is way faster. [Note that I wrote both classes
to convert to xml and to index the logs.  The log2xml step is quite useful
because the xml really has all the log information and compressing the xml
with xmlppm reduces the size of the logs by 97%.  This way I can archive the
log.xml without wasting much space.]

As another comparison point: the logs takes 100Gig per week while the
indices 35Gig per month!  This design is already more optimal than the first
approach.  But I'm trying to make it better.

I really do think that this dictionary/catalog approach could benefit others
Lucene users.  I'm not against the idea of doing it myself.  I just need
some pointers and guidelines for all the gurus out there!

Thanks for all you help!

-----Original Message-----
From: Doron Cohen [mailto:[hidden email]]
Sent: Tuesday, October 24, 2006 1:50 AM
To: [hidden email]
Subject: Re: "Catalog" backend for document stored fields?

> I'm indexing logs from a transaction-based application.
> ...
> millions documents per month, the size of the indices is ~35 gigs per
> (that's the lower bound).  I have no choice but to 'store' each field
> (as well as indexing/tokenizing them) because I'll need to retrieve them
> order to create various reports.  Also, I have a backlog of ~2 years of
> to index!
> ...
> 1-       is there someone out there that already wrote an extension to
> Lucene so that 'stored' string for each document/field is in fact stored
> a centralized repository? Meaning, only an 'index' is actually stored in
> document and the real data is put somewhere else.

Do you gain anything from storing the document fields within Lucene?  In
case not, especially if log files are kept somewhere, you cuold make all
'content' fields unstored (reduce index size), and add a stored non-indexed
ID field. It can also be a POINTER field - e.g. <log file name + start
offset + length>.  At search time, for found documents you can retrieve
this ID/POINTER field and then fetch the document from the (original) log
file. Makes sense?

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]