Data structure of a Lucene Index

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Data structure of a Lucene Index

Prasenjit Mukherjee-3
It seems to me that lucene doesn't use B-tree for its indexing storage.
Any paper/article which explains the theory behind data-structure of  
single index(segment).  I am not referring to the merge algorithm, I am
curious to know the storage structure of a single optimized lucene index.

Any pointer is greatly appreciated.
--Prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Data structure of a Lucene Index

Erik Hatcher

On Mar 28, 2006, at 11:57 PM, Prasenjit Mukherjee wrote:

> It seems to me that lucene doesn't use B-tree for its indexing  
> storage. Any paper/article which explains the theory behind data-
> structure of  single index(segment).  I am not referring to the  
> merge algorithm, I am curious to know the storage structure of a  
> single optimized lucene index.
>
> Any pointer is greatly appreciated.

How about this for starters?

        http://lucene.apache.org/java/docs/fileformats.html



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Data structure of a Lucene Index

Prasenjit Mukherjee-3
I have already gone through the fileformat. What I was looking for, is
the underlying  theory behind the chosen fileformats. I am sure those
fileformats were decided based on some theoritical axioms.

--prasen

[hidden email] wrote:

>
> On Mar 28, 2006, at 11:57 PM, Prasenjit Mukherjee wrote:
>
>> It seems to me that lucene doesn't use B-tree for its indexing  
>> storage. Any paper/article which explains the theory behind data-
>> structure of  single index(segment).  I am not referring to the  
>> merge algorithm, I am curious to know the storage structure of a  
>> single optimized lucene index.
>>
>> Any pointer is greatly appreciated.
>
>
> How about this for starters?
>
>    http://lucene.apache.org/java/docs/fileformats.html
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Data structure of a Lucene Index

Doug Cutting
In reply to this post by Prasenjit Mukherjee-3
I talked about this a bit in a presentation at Haifa last year:

http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf

See the section on "Seek versus Transfer".

Doug

Prasenjit Mukherjee wrote:

> It seems to me that lucene doesn't use B-tree for its indexing storage.
> Any paper/article which explains the theory behind data-structure of  
> single index(segment).  I am not referring to the merge algorithm, I am
> curious to know the storage structure of a single optimized lucene index.
>
> Any pointer is greatly appreciated.
> --Prasen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Implemented subclasses of Similarity class in Lucene

Ganesh Ramakrishnan
Hi

Is anyone aware of subclasses of the Similarity class in Lucene? Two subclasses are: DefaultSimilarity and  SimilarityDelegator . Are any other implemented subclasses of Similarity, developed by anyone else available on the web?  For example, Language Model based similarity, or Okapi-BM similarity or different TFIDF weighing scehemes for similarity.
 
  If so, can you point me to them?
 
  Thanks and regards,
  Ganesh.
                       
---------------------------------
Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.
Reply | Threaded
Open this post in threaded view
|

RE: Data structure of a Lucene Index

Dmitry Goldenberg
In reply to this post by Prasenjit Mukherjee-3
Ideally, I'd love to see an article explaining both in detail: the index structure as well as the merge algorithm...

________________________________

From: Prasenjit Mukherjee [mailto:[hidden email]]
Sent: Tue 3/28/2006 11:57 PM
To: [hidden email]
Subject: Data structure of a Lucene Index



It seems to me that lucene doesn't use B-tree for its indexing storage.
Any paper/article which explains the theory behind data-structure of
single index(segment).  I am not referring to the merge algorithm, I am
curious to know the storage structure of a single optimized lucene index.

Any pointer is greatly appreciated.
--Prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Data structure of a Lucene Index

Prasenjit Mukherjee-3
I think Doug's paper ( specifically the Seek and Transfer section ) is
the closest I could get. A little bit detailed explanation can be found
in Yates' book on Information-Retreival.  I agree with Dimitry, a
detailed explanation (or even pointers to some existing arcticle would
be beneficial to all of us).

--prasen

------------------------------------------------------------


I talked about this a bit in a presentation at Haifa last year:

http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf

See the section on "Seek versus Transfer".

Doug


Dmitry Goldenberg wrote:

>Ideally, I'd love to see an article explaining both in detail: the index structure as well as the merge algorithm...
>
>________________________________
>
>From: Prasenjit Mukherjee [mailto:[hidden email]]
>Sent: Tue 3/28/2006 11:57 PM
>To: [hidden email]
>Subject: Data structure of a Lucene Index
>
>
>
>It seems to me that lucene doesn't use B-tree for its indexing storage.
>Any paper/article which explains the theory behind data-structure of
>single index(segment).  I am not referring to the merge algorithm, I am
>curious to know the storage structure of a single optimized lucene index.
>
>Any pointer is greatly appreciated.
>--Prasen
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: Implemented subclasses of Similarity class in Lucene

Edgar Meij
In reply to this post by Ganesh Ramakrishnan
Hi Ganesh,

We have developed a Language Modeling extension to Lucene at the
University of Amsterdam. It can be found here:

http://ilps.science.uva.nl/Resources/#lm-lucen

It was build around Lucene 1.4.3, so it isn't source compatible with
the latest Lucene version. We are currently working on
integrating/updating it to Lucene 1.9.

Best,

Edgar Meij


On 3/31/06, Ganesh Ramakrishnan <[hidden email]> wrote:

> Hi
>
> Is anyone aware of subclasses of the Similarity class in Lucene? Two subclasses are: DefaultSimilarity and  SimilarityDelegator . Are any other implemented subclasses of Similarity, developed by anyone else available on the web?  For example, Language Model based similarity, or Okapi-BM similarity or different TFIDF weighing scehemes for similarity.
>
>   If so, can you point me to them?
>
>   Thanks and regards,
>   Ganesh.
>
> ---------------------------------
> Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.
>


--
'An approximate answer to the right question is worth a great deal
more than a precise answer to the wrong question'

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Implemented subclasses of Similarity class in Lucene

Murat Yakici
Hi Edgar,
While doing the integration/updating for Lucene 1.9, could you be more
open and clear about the design so that people can
1)Understand it,
2)Extend it,

Just an recommendation.

Cheers,
Murat

Edgar Meij wrote:

> Hi Ganesh,
>
> We have developed a Language Modeling extension to Lucene at the
> University of Amsterdam. It can be found here:
>
> http://ilps.science.uva.nl/Resources/#lm-lucen
>
> It was build around Lucene 1.4.3, so it isn't source compatible with
> the latest Lucene version. We are currently working on
> integrating/updating it to Lucene 1.9.
>
> Best,
>
> Edgar Meij
>
>
> On 3/31/06, Ganesh Ramakrishnan <[hidden email]> wrote:
>
>> Hi
>>
>> Is anyone aware of subclasses of the Similarity class in Lucene? Two
>> subclasses are: DefaultSimilarity and  SimilarityDelegator . Are any
>> other implemented subclasses of Similarity, developed by anyone else
>> available on the web?  For example, Language Model based similarity,
>> or Okapi-BM similarity or different TFIDF weighing scehemes for
>> similarity.
>>
>>   If so, can you point me to them?
>>
>>   Thanks and regards,
>>   Ganesh.
>>
>> ---------------------------------
>> Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low
>> rates.
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re[2]: Implemented subclasses of Similarity class in Lucene

Charlie-24
Hi Edgar,

Are there any technical reports explaining your design and
implementation of LM on Lucene?  Or what source files are exactly "LM
extension"?
--
Best regards,
 Charlie


---
Friday, May 26, 2006, 7:36:14 AM, you wrote:

> Hi Edgar,
> While doing the integration/updating for Lucene 1.9, could you be more
> open and clear about the design so that people can
> 1)Understand it,
> 2)Extend it,

> Just an recommendation.

> Cheers,
> Murat

> Edgar Meij wrote:

>> Hi Ganesh,
>>
>> We have developed a Language Modeling extension to Lucene at the
>> University of Amsterdam. It can be found here:
>>
>> http://ilps.science.uva.nl/Resources/#lm-lucen
>>
>> It was build around Lucene 1.4.3, so it isn't source compatible with
>> the latest Lucene version. We are currently working on
>> integrating/updating it to Lucene 1.9.
>>
>> Best,
>>
>> Edgar Meij
>>
>>
>> On 3/31/06, Ganesh Ramakrishnan
>> <[hidden email]> wrote:
>>
>>> Hi
>>>
>>> Is anyone aware of subclasses of the Similarity class in Lucene? Two
>>> subclasses are: DefaultSimilarity and  SimilarityDelegator . Are any
>>> other implemented subclasses of Similarity, developed by anyone else
>>> available on the web?  For example, Language Model based similarity,
>>> or Okapi-BM similarity or different TFIDF weighing scehemes for
>>> similarity.
>>>
>>>   If so, can you point me to them?
>>>
>>>   Thanks and regards,
>>>   Ganesh.
>>>




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]