Payloads

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Payloads

Michael Busch
Hi all,

currently it is not possible to add generic payloads to a posting list.
However, this feature would be useful for various use cases. Some examples:
- XML search
  to index XML documents and allow structured search (e.g. XPath) it is
neccessary to store the depths of the terms
- part-of-speech
  payloads can be used to store the part of speech of a term occurrence
- term boost
  for terms that occur e.g. in bold font a payload containing a boost
value can be stored
- ...

The feature payloads has been requested and discussed a couple of times,
e. g. in
- http://www.gossamer-threads.com/lists/lucene/java-dev/29465
- http://www.gossamer-threads.com/lists/lucene/java-dev/37409

In the latter thread I proposed a design a couple of months ago that
adds the possibility to Lucene to store variable-length payloads inline
in the posting list of a term. However, this design had some drawbacks:
the already complex field API was extended and the payloads encoding was
not optimal in terms of disk space.  Furthermore, the overall Lucene
runtime performance suffered due to the growth of the .prx file. In the
meantime the patch LUCENE-687 (Lazy skipping on proximity file) was
committed, which reduces the number of reads and seeks on the .prx file.
This minimizes the performance degradation of a bigger .prx file. Also,
LUCENE-695 (Improve BufferedIndexInput.readBytes() performance) was
committed, that speeds up reading mid-size chunks of bytes, which is
beneficial for payloads that are bigger than just a few bytes.

Some weeks ago I started working on an improved design which I would
like to propose now. The new design simplifies the API extensions (the
Field API remains unchanged) and uses less disk space in most use cases.
Now there are only two classes that get new methods:
- Token.setPayload()
  Use this method to add arbitrary metadata to a Token in the form of a
byte[] array.
 
- TermPositions.getPayload()
  Use this method to retrieve the payload of a term occurrence.
 
The implementation is very flexible: the user does not have to enable
payloads explicilty for a field and can add payloads to all, some or no
Tokens. Due to the improved encoding those use cases are handled
efficiently in terms of disk space.

Another thing I would like to point out is that this feature is
backwards compatible, meaning that the file format only changes if the
user explicitly adds payloads to the index. If no payloads are used, all
data structures remain unchanged.

I'm going to open a new JIRA issue soon containing the patch and details
about implementation and file format changes.

One more comment: It is a rather big patch and this is the initial
version, so I'm sure there will be a lot of discussions. I would like to
encourage people who consider this feature as useful to try it out and
give me some feedback about possible improvements.

Best regards,
- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Grant Ingersoll-4
Hi Michael,

Have a look at https://issues.apache.org/jira/browse/LUCENE-662

I am planning on starting on this soon (I know, I have been saying  
that for a while, but I really am.)  At any rate, another set of eyes  
would be good and I would be interested in hearing how your version  
compares/works with this patch from Nicolas.

-Grant

On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:

> Hi all,
>
> currently it is not possible to add generic payloads to a posting  
> list. However, this feature would be useful for various use cases.  
> Some examples:
> - XML search
>  to index XML documents and allow structured search (e.g. XPath) it  
> is neccessary to store the depths of the terms
> - part-of-speech
>  payloads can be used to store the part of speech of a term occurrence
> - term boost
>  for terms that occur e.g. in bold font a payload containing a  
> boost value can be stored
> - ...
>
> The feature payloads has been requested and discussed a couple of  
> times, e. g. in
> - http://www.gossamer-threads.com/lists/lucene/java-dev/29465
> - http://www.gossamer-threads.com/lists/lucene/java-dev/37409
>
> In the latter thread I proposed a design a couple of months ago  
> that adds the possibility to Lucene to store variable-length  
> payloads inline in the posting list of a term. However, this design  
> had some drawbacks: the already complex field API was extended and  
> the payloads encoding was not optimal in terms of disk space.  
> Furthermore, the overall Lucene runtime performance suffered due to  
> the growth of the .prx file. In the meantime the patch LUCENE-687  
> (Lazy skipping on proximity file) was committed, which reduces the  
> number of reads and seeks on the .prx file. This minimizes the  
> performance degradation of a bigger .prx file. Also, LUCENE-695  
> (Improve BufferedIndexInput.readBytes() performance) was committed,  
> that speeds up reading mid-size chunks of bytes, which is  
> beneficial for payloads that are bigger than just a few bytes.
>
> Some weeks ago I started working on an improved design which I  
> would like to propose now. The new design simplifies the API  
> extensions (the Field API remains unchanged) and uses less disk  
> space in most use cases. Now there are only two classes that get  
> new methods:
> - Token.setPayload()
>  Use this method to add arbitrary metadata to a Token in the form  
> of a byte[] array.
> - TermPositions.getPayload()
>  Use this method to retrieve the payload of a term occurrence.
> The implementation is very flexible: the user does not have to  
> enable payloads explicilty for a field and can add payloads to all,  
> some or no Tokens. Due to the improved encoding those use cases are  
> handled efficiently in terms of disk space.
>
> Another thing I would like to point out is that this feature is  
> backwards compatible, meaning that the file format only changes if  
> the user explicitly adds payloads to the index. If no payloads are  
> used, all data structures remain unchanged.
>
> I'm going to open a new JIRA issue soon containing the patch and  
> details about implementation and file format changes.
>
> One more comment: It is a rather big patch and this is the initial  
> version, so I'm sure there will be a lot of discussions. I would  
> like to encourage people who consider this feature as useful to try  
> it out and give me some feedback about possible improvements.
>
> Best regards,
> - Michael
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Nicolas Lalevée-2
Le Mercredi 20 Décembre 2006 15:31, Grant Ingersoll a écrit :
> Hi Michael,
>
> Have a look at https://issues.apache.org/jira/browse/LUCENE-662
>
> I am planning on starting on this soon (I know, I have been saying
> that for a while, but I really am.)  At any rate, another set of eyes
> would be good and I would be interested in hearing how your version
> compares/works with this patch from Nicolas.

In fact the work I have done is more about the storing part of Lucene than the
indexing part. But I think that the mechanism of defining in Java
an "IndexFormat" I have introduced in my patch will be usefull in defining
how the payload should be read and wrote.

About my patch, it needs to be synchronized with the current trunk. I will
update it soon. It just need some clean up.

Nicolas

>
> -Grant
>
> On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:
> > Hi all,
> >
> > currently it is not possible to add generic payloads to a posting
> > list. However, this feature would be useful for various use cases.
> > Some examples:
> > - XML search
> >  to index XML documents and allow structured search (e.g. XPath) it
> > is neccessary to store the depths of the terms
> > - part-of-speech
> >  payloads can be used to store the part of speech of a term occurrence
> > - term boost
> >  for terms that occur e.g. in bold font a payload containing a
> > boost value can be stored
> > - ...
> >
> > The feature payloads has been requested and discussed a couple of
> > times, e. g. in
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/29465
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/37409
> >
> > In the latter thread I proposed a design a couple of months ago
> > that adds the possibility to Lucene to store variable-length
> > payloads inline in the posting list of a term. However, this design
> > had some drawbacks: the already complex field API was extended and
> > the payloads encoding was not optimal in terms of disk space.
> > Furthermore, the overall Lucene runtime performance suffered due to
> > the growth of the .prx file. In the meantime the patch LUCENE-687
> > (Lazy skipping on proximity file) was committed, which reduces the
> > number of reads and seeks on the .prx file. This minimizes the
> > performance degradation of a bigger .prx file. Also, LUCENE-695
> > (Improve BufferedIndexInput.readBytes() performance) was committed,
> > that speeds up reading mid-size chunks of bytes, which is
> > beneficial for payloads that are bigger than just a few bytes.
> >
> > Some weeks ago I started working on an improved design which I
> > would like to propose now. The new design simplifies the API
> > extensions (the Field API remains unchanged) and uses less disk
> > space in most use cases. Now there are only two classes that get
> > new methods:
> > - Token.setPayload()
> >  Use this method to add arbitrary metadata to a Token in the form
> > of a byte[] array.
> > - TermPositions.getPayload()
> >  Use this method to retrieve the payload of a term occurrence.
> > The implementation is very flexible: the user does not have to
> > enable payloads explicilty for a field and can add payloads to all,
> > some or no Tokens. Due to the improved encoding those use cases are
> > handled efficiently in terms of disk space.
> >
> > Another thing I would like to point out is that this feature is
> > backwards compatible, meaning that the file format only changes if
> > the user explicitly adds payloads to the index. If no payloads are
> > used, all data structures remain unchanged.
> >
> > I'm going to open a new JIRA issue soon containing the patch and
> > details about implementation and file format changes.
> >
> > One more comment: It is a rather big patch and this is the initial
> > version, so I'm sure there will be a lot of discussions. I would
> > like to encourage people who consider this feature as useful to try
> > it out and give me some feedback about possible improvements.
> >
> > Best regards,
> > - Michael
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

--
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Doug Cutting
In reply to this post by Michael Busch
Michael Busch wrote:
 > Some weeks ago I started working on an improved design which I would
 > like to propose now. The new design simplifies the API extensions (the
 > Field API remains unchanged) and uses less disk space in most use cases.
 > Now there are only two classes that get new methods:
 > - Token.setPayload()
 >  Use this method to add arbitrary metadata to a Token in the form of a
 > byte[] array.
 >
 > - TermPositions.getPayload()
 >  Use this method to retrieve the payload of a term occurrence.

Michael,

This sounds like very good work.  The back-compatibility of this
approach is great.  But we should also consider this in the broader
context of index-format flexibility.

Three general approaches have been proposed.  They are not exclusive.

1. Make the index format extensible by adding user-implementable reader
and writer interfaces for postings.

2. Add a richer set of standard index formats, including things like
compressed fields, no-positions, per-position weights, etc.

3. Provide hooks for including arbitrary binary data.

Your proposal is of type (3).  LUCENE-662 is a (1).  Approaches of type
(2) are most friendly to non-Java implementations, since the semantics
of each variation are well-defined.

I don't see a reason not to pursue all three, but in a coordinated
manner.  In particular, we don't want to add a feature of type (3) that
would make it harder to add type (1) APIs.  It would thus be best if we
had a rough specification of type (1) and type (2).  A proposal of type
(2) is at:

http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

But I'm not sure that we yet have any proposed designs for an extensible
posting API.  (Is anyone aware of one?)  This payload proposal can
probably be easily incorporated into such a design, but I would have
more confidence if we had one.  I guess I should attempt one!

Here's a very rough, sketchy, first draft of a type (1) proposal.

IndexWriter#setPostingFormat(PostingFormat)
IndexWriter#setDictionaryFormat(DictionaryFormat)

interface PostingFormat {
   PostingInverter getInverter(FieldInfo, Segment, Directory);
   PostingReader getReader(FieldInfo, Segment, Directory);
   PostingWriter getWriter(FieldInfo, Segment, Directory);
}

interface PostingPointer {} ???

interface DictionaryFormat {
   DictionaryWriter getWriter(FieldInfo, Segment, Directory);
   DictionaryWriter getReader(FieldInfo, Segment, Directory);
}

IndexWriter#addDocument(Document doc)
   loop over doc.fields
     call PostingFormat#getPostingInverter(FieldInfo, Segment, Directory)
       to create a PostingInverter
     if field is analyzed
       call Analyzer#tokenStream() to get TokenStream
       loop over tokens
         PostingInverter#collectToken(Token, Field);
     else
       PostingInverter#collectToken(Field);

   call DictionaryFormat#getWriter(FieldInfo, Segment, Directory)
     to create a DictionaryWriter
   Iterator<Term> terms = PostingInverter#getTerms();
   loop over terms
     PostingPointer p = PostingInverter#getPointer();
     PostingInverter#write(term);
     DictionaryWriter#addTerm(term, p);

IndexMerger#mergePostings()
   call DictionaryFormat#getReader(FieldInfo, Segment, Directory)
     to create a DictionaryReader
   loop over fields
     call PostingFormat#getWriter(FieldInfo, Segment, Directory)
       to create a PostingWriter
     loop over segments
       call PostingFormat#getReader(FieldInfo, Segment, Directory)
        to create a PostingReader
       loop over dictionary.terms
         PostingPointer p = PostingWriter#getPointer();
         DictionaryWriter#addTerm(Term, p);
        loop over docs
          int doc = PostingReader#readPostings();
           PostingWriter#writePostings(doc);

So the question is, does something like this conflict with your
proposal?  Should Term and/or Token be extensible?  If so, what should
their interfaces look like?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Michael Busch
In reply to this post by Nicolas Lalevée-2
Nicolas Lalevée wrote:

> Le Mercredi 20 Décembre 2006 15:31, Grant Ingersoll a écrit :
>  
>> Hi Michael,
>>
>> Have a look at https://issues.apache.org/jira/browse/LUCENE-662
>>
>> I am planning on starting on this soon (I know, I have been saying
>> that for a while, but I really am.)  At any rate, another set of eyes
>> would be good and I would be interested in hearing how your version
>> compares/works with this patch from Nicolas.
>>    
>
> In fact the work I have done is more about the storing part of Lucene than the
> indexing part. But I think that the mechanism of defining in Java
> an "IndexFormat" I have introduced in my patch will be usefull in defining
> how the payload should be read and wrote.
>
> About my patch, it needs to be synchronized with the current trunk. I will
> update it soon. It just need some clean up.
>
> Nicolas
>
>  

That's right, Nicolas' patch makes the Lucene *store* more flexible,
whereas my payloads patch extends the *index* data structures.

Nicolas, I'm aware of your patch but haven't looked completely at it
yet. I think it would be a great thing if our patches would work
together. And with Dougs suggestions (see his response) we would be on
the right track to the flexible indexing format! I would love to work
together with you to achieve this goal. I will look at your patch more
closely in the next days.

- Michael



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Michael Busch
In reply to this post by Doug Cutting
Doug Cutting wrote:

> Michael,
>
> This sounds like very good work.  The back-compatibility of this
> approach is great.  But we should also consider this in the broader
> context of index-format flexibility.
>
> Three general approaches have been proposed.  They are not exclusive.
>
> 1. Make the index format extensible by adding user-implementable
> reader and writer interfaces for postings.
>
> 2. Add a richer set of standard index formats, including things like
> compressed fields, no-positions, per-position weights, etc.
>
> 3. Provide hooks for including arbitrary binary data.
>
> Your proposal is of type (3).  LUCENE-662 is a (1).  Approaches of
> type (2) are most friendly to non-Java implementations, since the
> semantics of each variation are well-defined.
>
> I don't see a reason not to pursue all three, but in a coordinated
> manner.  In particular, we don't want to add a feature of type (3)
> that would make it harder to add type (1) APIs.  It would thus be best
> if we had a rough specification of type (1) and type (2).  A proposal
> of type (2) is at:
>
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
>
> But I'm not sure that we yet have any proposed designs for an
> extensible posting API.  (Is anyone aware of one?)  This payload
> proposal can probably be easily incorporated into such a design, but I
> would have more confidence if we had one.  I guess I should attempt one!
>

Doug,

thanks for your detailed response. I'm aware that the long-term goal is
the flexible index format and I see the payloads patch only as a part of
it. The patch focuses on extending the index data structures and about a
possible payload encoding. It doesn't focus yet on a flexible API, it
only offers the two mentioned low-level methods to add and retrieve byte
arrays.

I would love to work with you guys on the flexible index format and to
combine my patch with your suggestions and the patch from Nicolas! I
will look at your proposal and Nicolas' patch tomorrow (have to go now).
I just attached my patch (LUCENE-755), so if you get a chance you could
take a look at it.

Maybe it would make sense now to follow your suggestion you made earlier
this year and start a new package to work on the new index format? On
the other hand, if people would like to use the payloads soon I guess
due to the backwards compatibility it would be low risk to add it to the
current index format to provide this feature until we can finish the
flexible format?

- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Doug Cutting
Michael Busch wrote:
> the other hand, if people would like to use the payloads soon I guess
> due to the backwards compatibility it would be low risk to add it to the
> current index format to provide this feature until we can finish the
> flexible format?

A reason not to commit something like this now would be if it
complicates the effort to make the format extensible.  Each index
feature we add now will require back-compatibility in the future, and we
should be hesitant to add features that might be difficult to support in
the future.

For example, this modifies the Token API.  If, long-term, we think that
Token should be extensible, then perhaps we should make it extensible
now, and add this through a subclass of Token (perhaps a mixin interface
that Tokens can implement).

I like the Payload feature, and think it should probably be added.  I
just want to make sure that we've first thought a bit about its
future-compatibility.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Michael Busch
Doug Cutting wrote:
>
> A reason not to commit something like this now would be if it
> complicates the effort to make the format extensible.  Each index
> feature we add now will require back-compatibility in the future, and
> we should be hesitant to add features that might be difficult to
> support in the future.
Yes, I agree.

I had the idea of defining Payload as an interface:

public interface Payload {
    void serialize(IndexOutput out) throws IOException;
    int serializedLength();
    void deserialize(IndexInput in, int length) throws IOException;
}

and to have a default implementation ByteArrayPayload that works like my
current patch. Then people could write their own implementation of
Payload and define how to serialize the content.
>
> For example, this modifies the Token API.  If, long-term, we think
> that Token should be extensible, then perhaps we should make it
> extensible now, and add this through a subclass of Token (perhaps a
> mixin interface that Tokens can implement).
>
Yes I could introduce a new class called e.g. PayloadToken that extends
Token (good that it is not final anymore). Not sure if I understand your
mixin interface idea... could you elaborate, please?

> I like the Payload feature, and think it should probably be added.  I
> just want to make sure that we've first thought a bit about its
> future-compatibility.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Doug Cutting
Michael Busch wrote:
> Yes I could introduce a new class called e.g. PayloadToken that extends
> Token (good that it is not final anymore). Not sure if I understand your
> mixin interface idea... could you elaborate, please?

I'm not entirely sure I understand it either!

If Payload is an interface that tokens might implement, then some
posting implementations would treat tokens that implement Payload
specially.  And there might be other interfaces, say, PartOfSpeech, or
Emphasis, that tokens might implement, and that might also be handled by
some posting implementations.  A particular analyzer could emit tokens
that implement several of these interfaces, e.g., both PartOfSpeech and
Emphasis.  So these interfaces would be mixins.  But, of course, they'd
also have to each be implemented by the Token subclass, since Java
doesn't support multi-inheritance of implementation.

I'm not sure this is the best approach: it's just the first one that
comes to my mind.  Perhaps instead Tokens should have a list of aspects,
each of which implement a TokenAspect interface, or somesuch.

It would be best to have an idea of how we'd like to be able to flexibly
add token features like text-emphasis and part-of-speech that are
handled specially by posting implementations before we add the Payload
feature.  So if the "mixin" approach is not a good idea, then we should
try to think of a better one.  If we can't think of a good approach,
then we can always punt, add Payloads now, and deal with the
consequences later.  But it's worth trying first.  Working through a few
examples in pseudo code is perhaps a worthwhile task.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Ning Li-3
> 1. Make the index format extensible by adding user-implementable reader
> and writer interfaces for postings.
> ...
> Here's a very rough, sketchy, first draft of a type (1) proposal.

Nice!

In approach 1, what is the best abstraction of a flexible index format
for Lucene?

The draft proposal seems to suggest the following (roughly):
  A dictionary entry is <Term, FilePointer>.
  A posting entry for a term in a document is <Doc, PostingContent>.
Classes which implement PostingFormat decide the format of PostingContent.

Storing all the posting content, e.g. frequencies and positions, in a
single file greatly simplifies things. However, this could cause some
performance penalty. For example, boolean query 'Apache AND Lucene'
would have to paw through positions. But position indexing for Apache
and Lucene is necessary to support phrase query '"Apache Lucene"'.

Is it a good idea to allow PostingFormat to decide whether and how to
store posting content in multiple files?
  A dictionary entry is <Term, <FilePointer>+>.
  A posting entry for a term in a document is <Doc, <PostingContent>+>.
Each PostingContent is stored in a separate file.

Or is a two-file abstraction good enough? It supports all formats in
approaches 2 and 3.
  A dictionary entry is <Term, FreqPointer, ProxPointer>.
  A posting entry for a term in a document is <Doc,
PerDocPostingContent, <Position, PerPositionPostingContent>+>.
Doc and PerDocPostingContent are stored in a .frq file.
Position and PerPositionPostingContent are stored in a .prx file.

What Michael called Payload can be viewed as PerPositionPostingContent here.


> I'm not sure this is the best approach: it's just the first one that
> comes to my mind.  Perhaps instead Tokens should have a list of aspects,
> each of which implement a TokenAspect interface, or somesuch.

Making Token have a list of aspects would work. A particular analyzer
would add certain types of aspects to the tokens it emits. For
example, one analyzer adds a TextEmphasis aspect to a token. Another
analyzer adds a PartOfSpeech aspect to the same token. A particular
posting implementation would expect certain types of aspects. For
example, one may require a TextEmphasis aspect and a PartOfSpeech
aspect. The posting implementation generates posting content (payload)
by encoding the values of both aspects.


Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Nicolas Lalevée-2
In reply to this post by Michael Busch
Le Mercredi 20 Décembre 2006 20:42, Michael Busch a écrit :

> Doug Cutting wrote:
> > Michael,
> >
> > This sounds like very good work.  The back-compatibility of this
> > approach is great.  But we should also consider this in the broader
> > context of index-format flexibility.
> >
> > Three general approaches have been proposed.  They are not exclusive.
> >
> > 1. Make the index format extensible by adding user-implementable
> > reader and writer interfaces for postings.
> >
> > 2. Add a richer set of standard index formats, including things like
> > compressed fields, no-positions, per-position weights, etc.
> >
> > 3. Provide hooks for including arbitrary binary data.
> >
> > Your proposal is of type (3).  LUCENE-662 is a (1).  Approaches of
> > type (2) are most friendly to non-Java implementations, since the
> > semantics of each variation are well-defined.
> >
> > I don't see a reason not to pursue all three, but in a coordinated
> > manner.  In particular, we don't want to add a feature of type (3)
> > that would make it harder to add type (1) APIs.  It would thus be best
> > if we had a rough specification of type (1) and type (2).  A proposal
> > of type (2) is at:
> >
> > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
> >
> > But I'm not sure that we yet have any proposed designs for an
> > extensible posting API.  (Is anyone aware of one?)  This payload
> > proposal can probably be easily incorporated into such a design, but I
> > would have more confidence if we had one.  I guess I should attempt one!
>
> Doug,
>
> thanks for your detailed response. I'm aware that the long-term goal is
> the flexible index format and I see the payloads patch only as a part of
> it. The patch focuses on extending the index data structures and about a
> possible payload encoding. It doesn't focus yet on a flexible API, it
> only offers the two mentioned low-level methods to add and retrieve byte
> arrays.
>
> I would love to work with you guys on the flexible index format and to
> combine my patch with your suggestions and the patch from Nicolas! I
> will look at your proposal and Nicolas' patch tomorrow (have to go now).
> I just attached my patch (LUCENE-755), so if you get a chance you could
> take a look at it.

I have just looked at it. It looks great :)
But I still doesn't understand why a new entry in the fieldinfo is needed.
There is the same for TermVector. And code like that fail for no obvious
reason :

Document doc = new Document();
doc.add(new Field("f1", "v1", Store.YES, Index.TOKENIZED,
TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("f1", "v2", Store.YES, Index.TOKENIZED, TermVector.NO));

RAMDirectory ram = new RAMDirectory();
IndexWriter writer = new IndexWriter(ram, new StandardAnalyzer(), true);
writer.addDocument(doc);
writer.close();

Knowing a little bit about how lucene works, I have an idea why this fail, but
can we avoid this ?

Nicolas

--
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Marvin Humphrey
In reply to this post by Ning Li-3

On Dec 21, 2006, at 1:58 PM, Ning Li wrote:

> Storing all the posting content, e.g. frequencies and positions, in a
> single file greatly simplifies things. However, this could cause some
> performance penalty. For example, boolean query 'Apache AND Lucene'
> would have to paw through positions. But position indexing for Apache
> and Lucene is necessary to support phrase query '"Apache Lucene"'.

Precision would be enhanced if boolean scoring took position into  
account, and could be further enhanced if each position were assigned  
a boost.  For that purpose, having everything in one file is an  
advantage, as it cuts down disk seeks.  Turn off freqs, positions,  
and boosts, and you have only doc_nums, which is ideal for matching  
rather than scoring, yielding a performance gain.

What's being considered doesn't really speak to the motivation of  
improving existing core functionality, though.  It's more about  
expanding the API to allow new applications.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Ning Li-3
On 12/22/06, Marvin Humphrey <[hidden email]> wrote:
> Precision would be enhanced if boolean scoring took position into
> account, and could be further enhanced if each position were assigned
> a boost.  For that purpose, having everything in one file is an
> advantage, as it cuts down disk seeks.  Turn off freqs, positions,
> and boosts, and you have only doc_nums, which is ideal for matching
> rather than scoring, yielding a performance gain.

I'm aware of this design. Boolean and phrase queries are an example.
The point is, there are different queries whose processing will
(continue to) require different information of terms, especially when
flexible posting is allowed. The question is, should the number of
files used to store postings be customizable?

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Marvin Humphrey

On Dec 22, 2006, at 9:17 AM, Ning Li wrote:

> The question is, should the number of
> files used to store postings be customizable?

I think it ought to remain an implementation detail for now.  Using  
multiple files is an optimization of unknown advantage.  
Optimizations have to work very hard to justify being put into public  
APIs because they constrain later refactoring and may in fact prevent  
better optimizations from being implemented later.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Doug Cutting
In reply to this post by Ning Li-3
Ning Li wrote:
> The draft proposal seems to suggest the following (roughly):
>  A dictionary entry is <Term, FilePointer>.

Perhaps this ought to be <Term, TermInfo>, where TermInfo contains a
FilePointer and perhaps other information (e.g., frequency data).

>  A posting entry for a term in a document is <Doc, PostingContent>.
> Classes which implement PostingFormat decide the format of PostingContent.

Yes.

> Is it a good idea to allow PostingFormat to decide whether and how to
> store posting content in multiple files?

Ideally, yes.  The easiest way to do this would be to have separate
files in each segment for each PostingFormat.  It would be better if
different posting formats could share files, but that's harder to
coordinate.

Alternately we could force all postings into a single file per segment.
  That would simplify the APIs, but prohibit certain file formats, like
the one Lucene uses currently.

So the ideal solution would permit both different formats to either
share a file, or to use their own file(s).  Is it worth the complexity
this would add to the API?  Or should we jettison the notion of multiple
posting files per segment?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Doug Cutting
In reply to this post by Ning Li-3
Ning Li wrote:
> I'm aware of this design. Boolean and phrase queries are an example.
> The point is, there are different queries whose processing will
> (continue to) require different information of terms, especially when
> flexible posting is allowed. The question is, should the number of
> files used to store postings be customizable?

If one needs to search the same data with both unranked boolean
operators and with ranked proximity, one could use different fields.  If
that's an acceptable answer, then we might get away with a single
posting file per segment.  Back-compatibility will be a pain, but we
probably shouldn't let that drive the design.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Ning Li-3
In reply to this post by Doug Cutting
On 12/22/06, Doug Cutting <[hidden email]> wrote:
> Ning Li wrote:
> > The draft proposal seems to suggest the following (roughly):
> >  A dictionary entry is <Term, FilePointer>.
>
> Perhaps this ought to be <Term, TermInfo>, where TermInfo contains a
> FilePointer and perhaps other information (e.g., frequency data).

Yes. Another example is skip data.

> So the ideal solution would permit both different formats to either
> share a file, or to use their own file(s).

Agree.

> Is it worth the complexity
> this would add to the API?  Or should we jettison the notion of multiple
> posting files per segment?

+1 for a single posting file per segment. I was wondering if we wanted
to provide all the flexibility possible. Things will be much simpler
with a single posting file per segment... :-)

Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Marvin Humphrey
In reply to this post by Doug Cutting

On Dec 22, 2006, at 10:36 AM, Doug Cutting wrote:

> The easiest way to do this would be to have separate files in each  
> segment for each PostingFormat.  It would be better if different  
> posting formats could share files, but that's harder to coordinate.

The approach I'm taking in KinoSearch 0.20 is for each field to get  
its own postings file, named _XXX.pYYY, where "_XXX" is the segment  
name and "YYY" is the field number.  That allows a single decoder to  
be pointed at each file.  _XXX.frq and _XXX.prx have been eliminated.

One file per format would also work.

> Alternately we could force all postings into a single file per  
> segment.  That would simplify the APIs, but prohibit certain file  
> formats, like the one Lucene uses currently.

In theory, we could also have one file per property: doc num, freq,  
positions, boost, payload.  The base Posting object would have only  
document number, and each subclass would add a new property, and a  
new file.

I'm not sure that's better, as it precludes optimizations such as the  
even/odd trick currently used in _XXX.frq, but it merits mention as  
the conceptual opposite of having one file per format.

Matchers would be happy with that scheme no matter what.

> So the ideal solution would permit both different formats to either  
> share a file, or to use their own file(s).  Is it worth the  
> complexity this would add to the API?  Or should we jettison the  
> notion of multiple posting files per segment?

Does punting on this issue have any drawbacks other than an unknown  
performance impact?  Can we design the API so that we leave open the  
option of allowing the user to spec multiple files if that proves  
advantageous later?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Michael Busch
In reply to this post by Nicolas Lalevée-2
Nicolas Lalevée wrote:
>
> I have just looked at it. It looks great :)
>  
Thanks! :-)

> But I still doesn't understand why a new entry in the fieldinfo is needed.
>  

The entry is not really *needed*, but I use it for
backwards-compatibility and as an optimization for fields that don't
have any tokens with payloads. For fields with payloads the
PositionDelta is shifted one bit, so for certain values this means that
the VInt needs an extra byte. I have an index with about 500k web
documents and measured, that about 8% of all PositionDelta values would
need one extra byte in case PositionDelta is shifted. For my index that
means roughly 4% growth of the total index size. With using a fieldbit,
payloads can be disabled for a field and therefore the shifting of
PositionDelta can be avoided. Furthermore, if the payload-fieldbit is
not enabled, then the index format does not change at all.

> There is the same for TermVector. And code like that fail for no obvious
> reason :
>
> Document doc = new Document();
> doc.add(new Field("f1", "v1", Store.YES, Index.TOKENIZED,
> TermVector.WITH_POSITIONS_OFFSETS));
> doc.add(new Field("f1", "v2", Store.YES, Index.TOKENIZED, TermVector.NO));
>
> RAMDirectory ram = new RAMDirectory();
> IndexWriter writer = new IndexWriter(ram, new StandardAnalyzer(), true);
> writer.addDocument(doc);
> writer.close();
>
> Knowing a little bit about how lucene works, I have an idea why this fail, but
> can we avoid this ?
>
> Nicolas
>  
In the payload case there is no problem like this one. There is no new
Field option that can be used to set the fieldbit explicitly. The bit is
set automatically for a field as soon as the first Token of that field
that carries a payload is encountered.

Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Payloads

Nadav Har'El
In reply to this post by Michael Busch
On Wed, Dec 20, 2006, Michael Busch wrote about "Payloads":
>..
> Some weeks ago I started working on an improved design which I would
> like to propose now. The new design simplifies the API extensions (the
> Field API remains unchanged) and uses less disk space in most use cases.
> Now there are only two classes that get new methods:
> - Token.setPayload()
>  Use this method to add arbitrary metadata to a Token in the form of a
> byte[] array.
>...

Hi Michael,

For some uses (e.g., faceted search), one wants to add a payload to each
document, not per position for some text field. In the faceted search example,
we could use payloads to encode the list of facets that each document
belongs to. For this, with the old API, you could have added a fixed term
to an untokenized field, add add a payload to that entire untokenized field.

With the new API, it seems doing this is much more difficult and requires
writing some sort of new Analyzer - one that will do the regular analysis
that I want for the regulr fields, and add the payload to the one specific
field that lists the facets.
Am I understanding correctly? Or am I missing a better way to do this?

Thanks,
Nadav.

--
Nadav Har'El                        |    Wednesday, Jan  3 2007, 13 Tevet 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |If you lost your left arm, your right arm
http://nadav.harel.org.il           |would be left.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12