"Catalog" backend for document stored fields?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

"Catalog" backend for document stored fields?

Robichaud, Jean-Philippe-2
Hello to all of you!

 

I'm using Lucene to index millions a relatively small documents.  In fact,
I'm indexing logs from a transaction-based application.  Each document
represents what happened inside during 'transaction'.  Each of them is
composed by 5-6 main 'states' which are themselves composed of a couple of
'events'.  The document structure is something like this:

 

State1.event1.some_key=value

State1.event1.another_key=another_value

[...]

State1.event4.another_key=yet another_value

 

State2.event1.a_third_key=bla bla bla

State3.event1. ...

 

All in all, each document has between 10 and 250 fields.  I can't fit this
in a db because the nature of theses 'transactions' is quite dynamics and I
can't think of a [simple/maintainable] database schema.  That's why Lucene
is so wonderful for this particular project. I have a super generic set of
classes that enable me to generate any kind of reports I want.  Really, it's
wonderful.

 

Now as you can imagine, indexing 'logs' means indexing really repetitive
information.  Some of the documents fields contain values like 'OK' 'failed'
... Others have more 'unique' values but all in all, there is a huge
redundancy between all theses documents.  Since I'm indexing about 20
millions documents per month, the size of the indices is ~35 gigs per month
(that's the lower bound).  I have no choice but to 'store' each field values
(as well as indexing/tokenizing them) because I'll need to retrieve them in
order to create various reports.  Also, I have a backlog of ~2 years of logs
to index!

 

All this to ask:

1-       is there someone out there that already wrote an extension to
Lucene so that 'stored' string for each document/field is in fact stored in
a centralized repository? Meaning, only an 'index' is actually stored in the
document and the real data is put somewhere else.

2-       If not, how hard would it be to write such extension?  Which
classes would need to be modified?  FSDirectory? Document?

3-       Any ideas on how else I could do this?  I'm fully open to
discussion!

 

Thanks for your help!

 

Jp

 

_____________________________________________

 

JEAN-PHILIPPE ROBICHAUD

Speech Scientist Professional Services

 

NUANCE COMMUNICATIONS, INC.

1500 University, suite 935

Montreal, Quebec  H3A 3S7

 

 

514 904 7800  Office

514 843 6872  Fax

 <http://www.nuance.com/> NUANCE.COM

 

The experience speaks for itself (tm)

 

Reply | Threaded
Open this post in threaded view
|

Re: "Catalog" backend for document stored fields?

eks dev

1-       is there someone out there that already wrote an extension to
Lucene so that 'stored' string for each document/field is in fact stored in
a centralized repository? Meaning, only an 'index' is actually stored in the
document and the real data is put somewhere else.

2-       If not, how hard would it be to write such extension?  Which
classes would need to be modified?  FSDirectory? Document?

3-       Any ideas on how else I could do this?  I'm fully open to
discussion!

 It's easy if I got what you need, you need some sort of simple dictionary compression. Write your Analyzer that is constructed with some HashMap<YOUR_STRING, Integer> and make it replace tokens with Integers (you could use VInts later to save some more space).  
Fill this HashMap  with unique terms from your field and if too many of them encode only the most frequent....

you have transformations in this case:
it starts with Document.Field == "Array of Strings" ->
-> put it in your analyzer
-> you get Document.Field == "Array of Integers" (presumably more space efficient for your case?)
->Store Ints as VInts to spare a few bits more

Later you will need an array or HashMap to revert back ints to tokens to reconstruct your docs

Of course, you will need to map your Queries the same way
Is that what you wanted?





Send instant messages to your online friends http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: "Catalog" backend for document stored fields?

Mike Klaas
In reply to this post by Robichaud, Jean-Philippe-2
On 10/20/06, Robichaud, Jean-Philippe
<[hidden email]> wrote:
> 3-       Any ideas on how else I could do this?  I'm fully open to
> discussion!

How about not storing the fields at all, but storing term vectors, and
reconstructing the data from termpositions + terminfo?

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: "Catalog" backend for document stored fields?

Robichaud, Jean-Philippe-2
In reply to this post by Robichaud, Jean-Philippe-2
That may be a good idea.  Is it possible to do this efficiently, like inside
of the collect() call of a hitCollector?  Right now, that's how my reporting
tool works:

Searcher searcher = new MultiSearcher(directories[] ...);
HitCollector myHC = new MyHitCollector(searcher, ...);
Searcher.search(myQuery,myHC);
myHC.reportStatistics();


And myHC.collect(int docid, float rawscore)  looks like

public void collect(int docid, float rawscore) {

  Document doc = searcher.doc(docid);

  String s1 = doc.get("field1");
  String s2 = doc.get("field2");
  String s3 = doc.get("field3");
   ...
 CumulateStatistics(s1,s2,s3,...);

}

I know that the indexreader has the termPositions method, but I can't use
this approach as I need to do this from within the 'search' call.  Do you
have an idea how I could use it in my scheme?

Jp
-----Original Message-----
From: Mike Klaas [mailto:[hidden email]]
Sent: Friday, October 20, 2006 5:00 PM
To: [hidden email]
Subject: Re: "Catalog" backend for document stored fields?

On 10/20/06, Robichaud, Jean-Philippe
<[hidden email]> wrote:
> 3-       Any ideas on how else I could do this?  I'm fully open to
> discussion!

How about not storing the fields at all, but storing term vectors, and
reconstructing the data from termpositions + terminfo?

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: "Catalog" backend for document stored fields?

Doron Cohen
In reply to this post by Robichaud, Jean-Philippe-2
> I'm indexing logs from a transaction-based application.
> ...
> millions documents per month, the size of the indices is ~35 gigs per
month
> (that's the lower bound).  I have no choice but to 'store' each field
values
> (as well as indexing/tokenizing them) because I'll need to retrieve them
in
> order to create various reports.  Also, I have a backlog of ~2 years of
logs
> to index!
> ...
> 1-       is there someone out there that already wrote an extension to
> Lucene so that 'stored' string for each document/field is in fact stored
in
> a centralized repository? Meaning, only an 'index' is actually stored in
the
> document and the real data is put somewhere else.

Do you gain anything from storing the document fields within Lucene?  In
case not, especially if log files are kept somewhere, you cuold make all
'content' fields unstored (reduce index size), and add a stored non-indexed
ID field. It can also be a POINTER field - e.g. <log file name + start
offset + length>.  At search time, for found documents you can retrieve
this ID/POINTER field and then fetch the document from the (original) log
file. Makes sense?


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]