Lucene Planning

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene Planning

Grant Ingersoll
I have added http://wiki.apache.org/jakarta-lucene/LucenePlanning to the
Wiki.  Currently there are two items of interest in it.  A start of some
documentation related to a Java 1.5 migration and a start of some
documentation concerning how to add more flexible indexing options and
how to store metadata at the index level.  The former conversation was
started by Karl on the developer's list and the latter was kicked off by
an email from me to Doug on how to implement #11 of
http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard

The pages are not meant to replace good discussions on the mailing list,
just to capture the group's consensus on how to move forward with the
big issues facing Lucene.

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene Planning

Nadav Har'El-2
Grant Ingersoll <[hidden email]> wrote on 28/05/2006 06:22:06 PM:

> I have added http://wiki.apache.org/jakarta-lucene/LucenePlanning to the
> Wiki.  Currently there are two items of interest in it.  A start of some
> documentation related to a Java 1.5 migration and a start of some
> documentation concerning how to add more flexible indexing options and
> how to store metadata at the index level.  The former conversation was
> started by Karl on the developer's list and the latter was kicked off by
> an email from me to Doug on how to implement #11 of
> http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard

I think the suggestion for position-specific boost is not enough,
and what is really be needed is a more general "payload" mechanism,
that allows storing with each position a variable length payload
(byte[]) which the application can use for its purposes. Such payloads
are essential for many applications - including XML search, faceted
search (if you don't want to cache stuff in memory, like people
suggested on a thread from last week), fast numeric search, and more.

Adding payloads is actually not difficult, but would require a change
to the index file format (probably the positions file) and some
changes to the basic indexing API (such as a new Field constructor
with a payload, adding payloads to tokens coming out of an analyzer,
and getting payloads from a TermPositions), so we better do this
after a bit of thought, and do it now - when it's natural to
start thinking about changes to the index file format.

Another, related, improvement, I think, should be to make positions
optional for certain fields. For some fields, positions are useless
because phrase search will never be used. For example, a field that
keeps a list of "categories" that a document is in. A document can
either be, or not be, in a category, but there is no significance
in the order of these categories in a document's list.


--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Flexible Indexing (was Re: Lucene Planning)

Marvin Humphrey

On May 31, 2006, at 6:53 AM, Nadav Har'El wrote:

> I think the suggestion for position-specific boost is not enough,
> and what is really be needed is a more general "payload" mechanism,
> that allows storing with each position a variable length payload
> (byte[]) which the application can use for its purposes.

Would the payload be inserted per-termdoc or per-posting (i.e. per-
position)?

One possible application of this scheme is order-by-date: stuff a  
numeric representation of a date into each termdoc.  That would  
consume an awful lot of index space, but it would make returning  
returning documents within a range of dates very fast.

Another possibility was raised by Grant in <http://wiki.apache.org/ 
jakarta-lucene/ConversationsBetweenDougMarvinAndGrant>: storing part-
of-speech along with position.

Doug described arbitrary extensibility via a per-Field codec here:  
<http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200605.mbox/ 
%[hidden email]%3E>.  I have to say, I find the idea  
of a pluggable posting format enticing.

> Adding payloads is actually not difficult, but would require a change
> to the index file format (probably the positions file)

In my view, the positions file, the freqs file, and the norms should  
all be merged into one, a la Google98[1].  However, an interesting  
wrinkle here is, if positions are optional, at what point in the term-
dictionary do you start applying the new decoder?  Perhaps we'd need  
one postings file per indexed field, or one file per codec.

> Another, related, improvement, I think, should be to make positions
> optional for certain fields.

Why stop there?  Norms/boosts are currently optional.  Why not make  
freqs optional as well?  That's the current state of the proposal at  
<http://wiki.apache.org/jakarta-lucene/FlexibleIndexing>.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] Brin/Page: "The Anatomy of a Large-Scale Hypertextual Web Search  
Engine" <http://dbpubs.stanford.edu:8090/pub/1998-8>.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene Planning

Doug Cutting
In reply to this post by Nadav Har'El-2
Nadav Har'El wrote:
> Grant Ingersoll <[hidden email]> wrote on 28/05/2006 06:22:06 PM:
>
>>I have added http://wiki.apache.org/jakarta-lucene/LucenePlanning
>
> I think the suggestion for position-specific boost is not enough,
> and what is really be needed is a more general "payload" mechanism,
> that allows storing with each position a variable length payload
> (byte[]) which the application can use for its purposes.

+1

I too have seen applications where this is required.

> we better do this
> after a bit of thought, and do it now - when it's natural to
> start thinking about changes to the index file format.

That's precisely what this wiki page is for: planning these changes to
the API and index file format.

> Another, related, improvement, I think, should be to make positions
> optional for certain fields.

That's the intent.  on

   http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

the [a,b,c,d] options are meant to be some different forms that postings
for a term could take, and the [1,2,3,4] are meant to be the flags that
one sets on a field to control the format of postings.  Logically there
are thus 16 possible formats, but many of these don't make sense and
would thus not be implemented.

I think what you're proposing above is that we replace the positionBoost
flag with a positionPayload flag.  Does that sound right?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible Indexing (was Re: Lucene Planning)

Marvin Humphrey
In reply to this post by Marvin Humphrey
[wild brainstorming...]

Another reason to consolidate the freqs, positions, and boosts/norms  
into one file: we can isolate and distill the code that encodes/
decodes that file into a plugin, weakening the current tight coupling  
between Lucene and its file format.  Changing that index format might  
then be a little less painful, as we'd just write a new plugin but  
leave the old one sitting there.  We may not be able to write plugin  
code for the an entire index, but we can write some for each file.

I'm imagining a PostingsWriter interface that each plugin would  
implement, then a complementary PostingsReader.  PostingsReader would  
look a lot like TermPositions does now, but would add getBoost().  To  
this, a POSPostingsReader subclass might add getPartOfSpeech().

In addition to the postings file, we might want a stored fields file  
plugin.  Maybe call those interfaces DBWriter and DBReader.  This is  
trickier, because stored fields are not inverted, so if we used  
different codecs for each field, their output would have to be  
interleaved.  Bleah.  Seems more like we'd want to use a plugin for  
the entire file, with a limited selection of per-field options.

Each segment would have a file recording which codecs were in use.  
Each field name, once associated with a codec, could not be modified  
to use another.  No more reconciliation of indexed/notIndexed,  
omitNorms/notOmitNorms.

Does it make sense then to have the Term Dictionary as a plugin?  I  
think so.  But maybe rather than ordering all terms first by field  
name then by term text, each indexed field should have its own  
dictionary file, ordered by term text.  Then the dictionary file  
could have per-field customization as well.

The point of this exercise is to generalize the high level data  
structures required by an inverted indexing engine.

   * Term Dictionary
   * Postings
   * Stored Fields Database
   * Term Vectors (optional)

In my view, each of these should have its own pluggable codec.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible Indexing (was Re: Lucene Planning)

Grant Ingersoll


Marvin Humphrey wrote:
>   * Term Vectors (optional)

Someone on the list a while ago suggested moving Term Vectors out of the
postings and storing them separately, as then they don't have to be
merged (but they doc ids would have to be kept up to date)

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible Indexing (was Re: Lucene Planning)

Marvin Humphrey

On Jun 1, 2006, at 5:48 AM, Grant Ingersoll wrote:

> Someone on the list a while ago suggested moving Term Vectors out  
> of the postings and storing them separately, as then they don't  
> have to be merged (but they doc ids would have to be kept up to date)

Yes, that was me.  :)  I suggested storing  TermVector data alongside  
stored field data, in the .fdt file.  That's what KinoSearch does  
right now.  It cuts down on disk seeks.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible Indexing (was Re: Lucene Planning)

Grant Ingersoll
I thought it was you, but wasn't sure.

I would also like a way to store the frequency of the term in the
overall collection (probably should go in the Term dictionary, but not
sure, at the cost of an additional VInt per term, but I am open to other
places to store it).  Right now, in order to calculate this, one has to
either store it separately at indexing time (using a term counting
Filter) or calculate it at runtime by looping over the TermDocs and
summing.

Marvin Humphrey wrote:

>
> On Jun 1, 2006, at 5:48 AM, Grant Ingersoll wrote:
>
>> Someone on the list a while ago suggested moving Term Vectors out of
>> the postings and storing them separately, as then they don't have to
>> be merged (but they doc ids would have to be kept up to date)
>
> Yes, that was me.  :)  I suggested storing  TermVector data alongside
> stored field data, in the .fdt file.  That's what KinoSearch does
> right now.  It cuts down on disk seeks.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible Indexing (was Re: Lucene Planning)

Marvin Humphrey

On Jun 2, 2006, at 6:48 AM, Grant Ingersoll wrote:
> I thought it was you, but wasn't sure.

I'm always looking for ways to minimize Term Vectors, because I  
consider excerpting/highlighting a core feature rather than an add-
on, and they seem like such overkill.  It bothers me that they  
duplicate so much information.

I've been toying with the idea of a hitCollector.collect(int docNum,  
float score, ScorePositions[] scorePositions) method -- or, more  
likely, a hitCollector.collect(Scorer scorer) method -- that would  
preserve each position that contributed to the score of a document  
and how much it contributed, allowing that information to be passed  
through a Hit object to the Highlighter.

That might be complemented storing the startOffsets and endOffsets  
for each field as streams of delta-encoded VInts along with the  
stored field data.  Conceptually, it would be even cleaner to keep  
startOffsets and endOffsets in the postings...

a. <doc>+

b. <doc, boost>+

c. <doc, freq, <position>+ >+

d. <doc, freq, <position, boost>+ >+

e. <doc, freq, <position, boost, startOffset, endOffset>+ >+

... and pass *everything* the Highlighter needs to the Hit object.  
However, the offsets are never needed for scoring.

> I would also like a way to store the frequency of the term in the  
> overall collection (probably should go in the Term dictionary, but  
> not sure, at the cost of an additional VInt per term, but I am open  
> to other places to store it).  Right now, in order to calculate  
> this, one has to either store it separately at indexing time (using  
> a term counting Filter) or calculate it at runtime by looping over  
> the TermDocs and summing.

Sure, makes sense to me.  Sounds like a custom codec you'd define.  
(The following code has been swiped and adapted from TermBuffer...)

public class CollFreqCodec extends TermDictionaryCodec {
   private collFreq;

   public void readRecord (IndexInput input, FieldInfos fieldInfos)
     throws IOException {
     this.term = null;                           // invalidate cache
     int start = input.readVInt();
     int length = input.readVInt();
     int totalLength = start + length;
     setBytesLength(totalLength);
     input.readBytes(this.bytes, start, length);
     this.field = fieldInfos.fieldName(input.readVInt());
     this.collFreq = input.readVInt();
   }
}

That's not quite right, because I'm envisioning a codec rather than a  
TermBuffer subclass, but maybe you get the idea.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]