[jira] Created: (LUCENE-755) Payloads

classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-755) Payloads

JIRA jira@apache.org
Payloads
--------

                 Key: LUCENE-755
                 URL: http://issues.apache.org/jira/browse/LUCENE-755
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Index
            Reporter: Michael Busch
         Assigned To: Michael Busch


This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.

A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.

API and Usage
------------------------------  
The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.

In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
  /** Sets this Token's payload. */
  public void setPayload(Payload payload);
 
  /** Returns this Token's payload. */
  public Payload getPayload();

In order to retrieve the data from the index the interface TermPositions now offers two new methods:
  /** Returns the payload length of the current term position.
   *  This is invalid until {@link #nextPosition()} is called for
   *  the first time.
   *
   * @return length of the current payload in number of bytes
   */
  int getPayloadLength();
 
  /** Returns the payload data of the current term position.
   * This is invalid until {@link #nextPosition()} is called for
   * the first time.
   * This method must not be called more than once after each call
   * of {@link #nextPosition()}. However, payloads are loaded lazily,
   * so if the payload data for the current position is not needed,
   * this method may not be called at all for performance reasons.
   *
   * @param data the array into which the data of this payload is to be
   *             stored, if it is big enough; otherwise, a new byte[] array
   *             is allocated for this purpose.
   * @param offset the offset in the array into which the data of this payload
   *               is to be stored.
   * @return a byte[] array containing the data of this payload
   * @throws IOException
   */
  byte[] getPayload(byte[] data, int offset) throws IOException;

Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.

Implementation details
------------------------------
- One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
   * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
   * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
- Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
- Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
- Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
- In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
- Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
 
Changes of file formats
------------------------------
- FieldInfos (.fnm)
The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.

- ProxFile (.prx)
ProxFile (.prx) -->  <TermPositions>^TermCount
TermPositions   --> <Positions>^DocFreq
Positions       --> <PositionDelta, Payload?>^Freq
Payload         --> <PayloadLength?, PayloadData>
PositionDelta   --> VInt
PayloadLength   --> VInt
PayloadData     --> byte^PayloadLength

For payloads disabled (unchanged):
PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
 
For Payloads enabled:
PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.

- FreqFile (.frq)

SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
PayloadLength --> VInt

For payloads disabled (unchanged):
DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.

For payloads enabled:
DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.


This encoding is space efficient for different use cases:
   * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
   * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
   * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.

All unit tests pass.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-755) Payloads

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/LUCENE-755?page=all ]

Michael Busch updated LUCENE-755:
---------------------------------

    Attachment: payloads.patch

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: http://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.
> API and Usage
> ------------------------------  
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>  
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    *
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>  
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    *
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose.
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>  
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>  
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-755) Payloads

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-755?page=comments#action_12460496 ]
           
Grant Ingersoll commented on LUCENE-755:
----------------------------------------

Great patch, Michael, and something that will come in handy for a lot of people.  I can vouch it applies cleanly and all the tests pass.  

Now I am not sure I am totally understanding everything just yet so the following is thinking aloud, but bear with me.

One of the big unanswered questions (besides how this fits into the whole flexible indexing scheme as discussed on the Payloads and Flexible indexing threads on java-dev) at this point for me is: how do we expose/integrate this into the scoring side of the equation?  It seems we would need some interfaces that hook into the scoring mechanism so that people can define what all these payloads are actually used for, or am I missing something?  Yet the TermScorer takes in the TermDocs, so it doesn't yet have access to the payloads (although this is easily remedied since we have access to the TermPositions when we construct TermScorer.)  Span Queries could easily be extended to include payload information since they use the TermPositions, which would be useful for post-processing algorithms.

I can imagine an interface that you would have to be set on the Query/Scorer (and inherited unless otherwise set???).  The default implementation would be to ignore any payload, I suppose.  We could also add a callback in the Similarity mechanism, something like:

float calculatePayloadFactor(byte[] payload);
or
float calculatePayloadFactor(Term term, byte[] payload);

Then this factor could be added/multiplied into the term score or whatever other scorers use it??????

Is this making any sense?


> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: http://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.
> API and Usage
> ------------------------------  
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>  
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    *
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>  
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    *
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose.
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>  
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>  
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-755) Payloads

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-755?page=comments#action_12460647 ]
           
Michael Busch commented on LUCENE-755:
--------------------------------------

> Great patch, Michael, and something that will come in handy for a lot of people. I can vouch it applies cleanly and all the tests pass.

Cool, thanks for trying it out, Grant! :-)

> Now I am not sure I am totally understanding everything just yet so the following is thinking aloud, but bear with me.

> One of the big unanswered questions (besides how this fits into the whole flexible indexing scheme as discussed on the Payloads and
> Flexible indexing threads on java-dev) at this point for me is: how do we expose/integrate this into the scoring side of the equation? It seems
> we would need some interfaces that hook into the scoring mechanism so that people can define what all these payloads are actually used
> for, or am I missing something? Yet the TermScorer takes in the TermDocs, so it doesn't yet have access to the payloads (although this is
> easily remedied since we have access to the TermPositions when we construct TermScorer.) Span Queries could easily be extended to
> include payload information since they use the TermPositions, which would be useful for post-processing algorithms.

I would say it really depends on the use case of the payloads. For example XML search: here payloads can be used to store depths information of terms. An extended Span class could then take the depth information into account for query evaluation. As you pointed out the span classes already have easy access to the payloads since they use TermPositions, so to implement such a subclass should be fairly simple.

> I can imagine an interface that you would have to be set on the Query/Scorer (and inherited unless otherwise set???). The default
> implementation would be to ignore any payload, I suppose. We could also add a callback in the Similarity mechanism, something like:
>
> float calculatePayloadFactor(byte[] payload);
> or
> float calculatePayloadFactor(Term term, byte[] payload);
>
> Then this factor could be added/multiplied into the term score or whatever other scorers use it??????
>
> Is this making any sense?

I believe the case you're describing here is per-term norms/boosts? Yah I think this would work and you are right, the Scorers have to have access to TermPositions, TermDocs is not sufficient. So yes, it would be nice if TermScorer would use TermPositions instead of TermDocs. I just opened LUCENE-761, which changes SegmentTermPositions to clone the proxStream lazily at the first time nextPosition() is called. Then the costs for creating TermDocs and TermPositions are the same and together with lazy prox skipping (LUCENE-687) there's no reason anymore to not use TermPositions.

However, as currently discussed on java-dev, per-term boosts could also be part of a new posting format in the flexible index scheme and thus not stored in the payloads.

So in general this patch doesn't add yet a new search feature to Lucene, it rather opens the door for new features in the future. The way to add such a new feature is then:
1) Write an analyzer that provides data neccessary for the new feature and produces Tokens with payloads containing these data.
2) Write/extend a Scorer that has access to TermPositions and makes use of the data in the payloads for matching or scoring or both.


> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: http://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.
> API and Usage
> ------------------------------  
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>  
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    *
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>  
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    *
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose.
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>  
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>  
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-755) Payloads

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Lalevée updated LUCENE-755:
-----------------------------------

    Attachment: payload.patch

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.
> API and Usage
> ------------------------------  
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>  
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    *
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>  
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    *
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose.
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>  
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>  
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-755) Payloads

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463414 ]

Nicolas Lalevée commented on LUCENE-755:
----------------------------------------

The patch I have just upload (payload.patch) is Michael's one (payloads.patch) with the customization of how payload are written and read, exactly like I did for Lucene-662. An IndexFormat is in fact a factory of PayloadWriter and PayloadReader, this index format being stored in the Directory instance.

Note that I haven't changed the javadoc neither the comments included in Michael's patch, it needs some cleanup if somebody is interested in commiting it.
And sorry for the name of the patch I have uploaded, it is a little bit confusing now, and I can't change it's name. I will be more carefull next time when naming my patch files.

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.
> API and Usage
> ------------------------------  
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>  
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    *
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>  
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    *
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose.
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>  
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>  
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Commented: (LUCENE-755) Payloads

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479781 ]

Grant Ingersoll commented on LUCENE-755:
----------------------------------------

Nicolas,

Are you implying your patch fits in with 662 (and needs to be applied after) or it is just in the style of 662 but isn't dependent on?

Thanks,
Grant

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.
> API and Usage
> ------------------------------  
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>  
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    *
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>  
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    *
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose.
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>  
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>  
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Commented: (LUCENE-755) Payloads

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479841 ]

Nicolas Lalevée commented on LUCENE-755:
----------------------------------------

Grant>
The patch I have propsed here has no dependency on LUCENE-662, I just "imported" some ideas from it and put them there. Since the LUCENE-662 have involved, the patches will probably make conflicts. The best to use here is Michael's one. I think it won't conflit with LUCENE-662. And if both are intended to be commited, then the best is to commit the both seperately and redo the work I have done with the provided patch (I remember that it was quite easy).


> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.
> API and Usage
> ------------------------------  
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>  
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    *
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>  
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    *
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose.
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>  
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>  
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Commented: (LUCENE-755) Payloads

Grant Ingersoll-4
I think it makes the most sense to get flexible indexing in first,  
and then make payloads work with it.  On the other hand, payloads  
looked pretty straightforward to me, whereas FI is much more involved  
(or at least it feels that way).

As it is right now, I would like to at least review the two patches  
and start thinking about them in greater depth.  The payloads patch  
needs a little more work in that I want to integrate it with the  
Similarity class so people can customize their scoring.

-Grant

On Mar 10, 2007, at 9:30 AM, Nicolas Lalevée (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-755?
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel#action_12479841 ]
>
> Nicolas Lalevée commented on LUCENE-755:
> ----------------------------------------
>
> Grant>
> The patch I have propsed here has no dependency on LUCENE-662, I  
> just "imported" some ideas from it and put them there. Since the  
> LUCENE-662 have involved, the patches will probably make conflicts.  
> The best to use here is Michael's one. I think it won't conflit  
> with LUCENE-662. And if both are intended to be commited, then the  
> best is to commit the both seperately and redo the work I have done  
> with the provided patch (I remember that it was quite easy).
>
>
>> Payloads
>> --------
>>
>>                 Key: LUCENE-755
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: Index
>>            Reporter: Michael Busch
>>         Assigned To: Michael Busch
>>         Attachments: payload.patch, payloads.patch
>>
>>
>> This patch adds the possibility to store arbitrary metadata  
>> (payloads) together with each position of a term in its posting  
>> lists. A while ago this was discussed on the dev mailing list,  
>> where I proposed an initial design. This patch has a much improved  
>> design with modifications, that make this new feature easier to  
>> use and more efficient.
>> A payload is an array of bytes that can be stored inline in the  
>> ProxFile (.prx). Therefore this patch provides low-level APIs to  
>> simply store and retrieve byte arrays in the posting lists in an  
>> efficient way.
>> API and Usage
>> ------------------------------
>> The new class index.Payload is basically just a wrapper around a  
>> byte[] array together with int variables for offset and length. So  
>> a user does not have to create a byte array for every payload, but  
>> can rather allocate one array for all payloads of a document and  
>> provide offset and length information. This reduces object  
>> allocations on the application side.
>> In order to store payloads in the posting lists one has to provide  
>> a TokenStream or TokenFilter that produces Tokens with payloads. I  
>> added the following two methods to the Token class:
>>   /** Sets this Token's payload. */
>>   public void setPayload(Payload payload);
>>
>>   /** Returns this Token's payload. */
>>   public Payload getPayload();
>> In order to retrieve the data from the index the interface  
>> TermPositions now offers two new methods:
>>   /** Returns the payload length of the current term position.
>>    *  This is invalid until {@link #nextPosition()} is called for
>>    *  the first time.
>>    *
>>    * @return length of the current payload in number of bytes
>>    */
>>   int getPayloadLength();
>>
>>   /** Returns the payload data of the current term position.
>>    * This is invalid until {@link #nextPosition()} is called for
>>    * the first time.
>>    * This method must not be called more than once after each call
>>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>>    * so if the payload data for the current position is not needed,
>>    * this method may not be called at all for performance reasons.
>>    *
>>    * @param data the array into which the data of this payload is  
>> to be
>>    *             stored, if it is big enough; otherwise, a new byte
>> [] array
>>    *             is allocated for this purpose.
>>    * @param offset the offset in the array into which the data of  
>> this payload
>>    *               is to be stored.
>>    * @return a byte[] array containing the data of this payload
>>    * @throws IOException
>>    */
>>   byte[] getPayload(byte[] data, int offset) throws IOException;
>> Furthermore, this patch indroduces the new method  
>> IndexOutput.writeBytes(byte[] b, int offset, int length). So far  
>> there was only a writeBytes()-method without an offset argument.
>> Implementation details
>> ------------------------------
>> - One field bit in FieldInfos is used to indicate if payloads are  
>> enabled for a field. The user does not have to enable payloads for  
>> a field, this is done automatically:
>>    * The DocumentWriter enables payloads for a field, if one ore  
>> more Tokens carry payloads.
>>    * The SegmentMerger enables payloads for a field during a  
>> merge, if payloads are enabled for that field in one or more  
>> segments.
>> - Backwards compatible: If payloads are not used, then the formats  
>> of the ProxFile and FreqFile don't change
>> - Payloads are stored inline in the posting list of a term in the  
>> ProxFile. A payload of a term occurrence is stored right after its  
>> PositionDelta.
>> - Same-length compression: If payloads are enabled for a field,  
>> then the PositionDelta is shifted one bit. The lowest bit is used  
>> to indicate whether the length of the following payload is stored  
>> explicitly. If not, i. e. the bit is false, then the payload has  
>> the same length as the payload of the previous term occurrence.
>> - In order to support skipping on the ProxFile the length of the  
>> payload at every skip point has to be known. Therefore the payload  
>> length is also stored in the skip list located in the FreqFile.  
>> Here the same-length compression is also used: The lowest bit of  
>> DocSkip is used to indicate if the payload length is stored for a  
>> SkipDatum or if the length is the same as in the last SkipDatum.
>> - Payloads are loaded lazily. When a user calls  
>> TermPositions.nextPosition() then only the position and the  
>> payload length is loaded from the ProxFile. If the user calls  
>> getPayload() then the payload is actually loaded. If getPayload()  
>> is not called before nextPosition() is called again, then the  
>> payload data is just skipped.
>>
>> Changes of file formats
>> ------------------------------
>> - FieldInfos (.fnm)
>> The format of the .fnm file does not change. The only change is  
>> the use of the sixth lowest-order bit (0x20) of the FieldBits. If  
>> this bit is set, then payloads are enabled for the corresponding  
>> field.
>> - ProxFile (.prx)
>> ProxFile (.prx) -->  <TermPositions>^TermCount
>> TermPositions   --> <Positions>^DocFreq
>> Positions       --> <PositionDelta, Payload?>^Freq
>> Payload         --> <PayloadLength?, PayloadData>
>> PositionDelta   --> VInt
>> PayloadLength   --> VInt
>> PayloadData     --> byte^PayloadLength
>> For payloads disabled (unchanged):
>> PositionDelta is the difference between the position of the  
>> current occurrence in the document and the previous occurrence (or  
>> zero, if this is the first   occurrence in this document).
>>
>> For Payloads enabled:
>> PositionDelta/2 is the difference between the position of the  
>> current occurrence in the document and the previous occurrence. If  
>> PositionDelta is odd, then PayloadLength is stored. If  
>> PositionDelta is even, then the length of the current payload  
>> equals the length of the previous payload and thus PayloadLength  
>> is omitted.
>> - FreqFile (.frq)
>> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
>> PayloadLength --> VInt
>> For payloads disabled (unchanged):
>> DocSkip records the document number before every SkipInterval th  
>> document in TermFreqs. Document numbers are represented as  
>> differences from the previous value in the sequence.
>> For payloads enabled:
>> DocSkip/2 records the document number before every SkipInterval  
>> th  document in TermFreqs. If DocSkip is odd, then PayloadLength  
>> follows. If DocSkip is even, then the length of the payload at the  
>> current skip point equals the length of the payload at the last  
>> skip point and thus PayloadLength is omitted.
>> This encoding is space efficient for different use cases:
>>    * If only some fields of an index have payloads, then there's  
>> no space overhead for the fields with payloads disabled.
>>    * If the payloads of consecutive term positions have the same  
>> length, then the length only has to be stored once for every term.  
>> This should be a common case, because users probably use the same  
>> format for all payloads.
>>    * If only a few terms of a field have payloads, then we don't  
>> waste much space because we benefit again from the same-length-
>> compression since we only have to store the length zero for the  
>> empty payloads once per term.
>> All unit tests pass.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Flexible indexing (was: Re: Commented: (LUCENE-755) Payloads)

Michael Busch
Hi Grant,

LUCENE-662 contains different ideas:
1) introduction of an index format concept
2) extensibility of the store reader/writer
3) New: extensibility of the posting reader/writer

IMO we should split this up, that way it will be easier to develop
smaller patches that focus on adding one particular feature. However, it
is important to plan the API, so that different patches (like payloads)
fit in. On the other hand it will be nearly impossible to plan an API
that is perfect and won't change anymore without having the actual
implementions. Therefore I suggest the following steps:
a) define the different work items of flexible indexing
b) plan a API rougly that fits with all items
c) develop the different items, commit them but with either protected or
as experimental marked APIs
d) after all items are completed and committed (and hopefully tested by
some brave community members ;)) finalize the API and remove
experimental comments (or make public)

Let's start with a):

The following items come to my mind (please feel free to
add/remove/complain):
- Introduce index-level metadata. Preferable in XML format, so it will
be human readable. Later on, we can store information about the index
format in this file, like the codecs that are used to store the data. We
should also make this public, so that users can store their own index
metadata. (Remark: LUCENE-783 is also a neat idea, we can write one xml
parser for both items)

- Introduce index format. Nicolas has already written a lot of code in
this regard! It will include different interfaces for the different
extension points (FieldsFormat, PostingFormat, DictionaryFormat). We can
use the xml file to store which actual formats are used in the
corresponding index.

- Implement the different extensions. LUCENE-662 includes an extensible
FieldsWriter, LUCENE-755 the payloads feature. Doug and Ning suggested
already nice interfaces for PostingFormat and DictionaryFormat in the
payloads thread on java-dev.

- Write standard implementations for the different formats. In the wiki
is already a list of desired posting formats.


I suggest we should finalize this list first. Then I will add this list
to the wiki under Flexible indexing and gather information from the
different discussions on java-dev which I already mentioned. Then we
should discuss the different items of this list in greater depth and
plan the APIs (step b) ).  And then we're already ready for step c) and
the fun starts :-).

Michael


Grant Ingersoll wrote:

> I think it makes the most sense to get flexible indexing in first, and
> then make payloads work with it.  On the other hand, payloads looked
> pretty straightforward to me, whereas FI is much more involved (or at
> least it feels that way).
>
> As it is right now, I would like to at least review the two patches
> and start thinking about them in greater depth.  The payloads patch
> needs a little more work in that I want to integrate it with the
> Similarity class so people can customize their scoring.
>
> -Grant
>
> On Mar 10, 2007, at 9:30 AM, Nicolas Lalevée (JIRA) wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479841 
>> ]
>>
>> Nicolas Lalevée commented on LUCENE-755:
>> ----------------------------------------
>>
>> Grant>
>> The patch I have propsed here has no dependency on LUCENE-662, I just
>> "imported" some ideas from it and put them there. Since the
>> LUCENE-662 have involved, the patches will probably make conflicts.
>> The best to use here is Michael's one. I think it won't conflit with
>> LUCENE-662. And if both are intended to be commited, then the best is
>> to commit the both seperately and redo the work I have done with the
>> provided patch (I remember that it was quite easy).
>>
>>
>>> Payloads
>>> --------
>>>
>>>                 Key: LUCENE-755
>>>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>>>             Project: Lucene - Java
>>>          Issue Type: New Feature
>>>          Components: Index
>>>            Reporter: Michael Busch
>>>         Assigned To: Michael Busch
>>>         Attachments: payload.patch, payloads.patch
>>>
>>>
>>> This patch adds the possibility to store arbitrary metadata
>>> (payloads) together with each position of a term in its posting
>>> lists. A while ago this was discussed on the dev mailing list, where
>>> I proposed an initial design. This patch has a much improved design
>>> with modifications, that make this new feature easier to use and
>>> more efficient.
>>> A payload is an array of bytes that can be stored inline in the
>>> ProxFile (.prx). Therefore this patch provides low-level APIs to
>>> simply store and retrieve byte arrays in the posting lists in an
>>> efficient way.
>>> API and Usage
>>> ------------------------------
>>> The new class index.Payload is basically just a wrapper around a
>>> byte[] array together with int variables for offset and length. So a
>>> user does not have to create a byte array for every payload, but can
>>> rather allocate one array for all payloads of a document and provide
>>> offset and length information. This reduces object allocations on
>>> the application side.
>>> In order to store payloads in the posting lists one has to provide a
>>> TokenStream or TokenFilter that produces Tokens with payloads. I
>>> added the following two methods to the Token class:
>>>   /** Sets this Token's payload. */
>>>   public void setPayload(Payload payload);
>>>
>>>   /** Returns this Token's payload. */
>>>   public Payload getPayload();
>>> In order to retrieve the data from the index the interface
>>> TermPositions now offers two new methods:
>>>   /** Returns the payload length of the current term position.
>>>    *  This is invalid until {@link #nextPosition()} is called for
>>>    *  the first time.
>>>    *
>>>    * @return length of the current payload in number of bytes
>>>    */
>>>   int getPayloadLength();
>>>
>>>   /** Returns the payload data of the current term position.
>>>    * This is invalid until {@link #nextPosition()} is called for
>>>    * the first time.
>>>    * This method must not be called more than once after each call
>>>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>>>    * so if the payload data for the current position is not needed,
>>>    * this method may not be called at all for performance reasons.
>>>    *
>>>    * @param data the array into which the data of this payload is to be
>>>    *             stored, if it is big enough; otherwise, a new
>>> byte[] array
>>>    *             is allocated for this purpose.
>>>    * @param offset the offset in the array into which the data of
>>> this payload
>>>    *               is to be stored.
>>>    * @return a byte[] array containing the data of this payload
>>>    * @throws IOException
>>>    */
>>>   byte[] getPayload(byte[] data, int offset) throws IOException;
>>> Furthermore, this patch indroduces the new method
>>> IndexOutput.writeBytes(byte[] b, int offset, int length). So far
>>> there was only a writeBytes()-method without an offset argument.
>>> Implementation details
>>> ------------------------------
>>> - One field bit in FieldInfos is used to indicate if payloads are
>>> enabled for a field. The user does not have to enable payloads for a
>>> field, this is done automatically:
>>>    * The DocumentWriter enables payloads for a field, if one ore
>>> more Tokens carry payloads.
>>>    * The SegmentMerger enables payloads for a field during a merge,
>>> if payloads are enabled for that field in one or more segments.
>>> - Backwards compatible: If payloads are not used, then the formats
>>> of the ProxFile and FreqFile don't change
>>> - Payloads are stored inline in the posting list of a term in the
>>> ProxFile. A payload of a term occurrence is stored right after its
>>> PositionDelta.
>>> - Same-length compression: If payloads are enabled for a field, then
>>> the PositionDelta is shifted one bit. The lowest bit is used to
>>> indicate whether the length of the following payload is stored
>>> explicitly. If not, i. e. the bit is false, then the payload has the
>>> same length as the payload of the previous term occurrence.
>>> - In order to support skipping on the ProxFile the length of the
>>> payload at every skip point has to be known. Therefore the payload
>>> length is also stored in the skip list located in the FreqFile. Here
>>> the same-length compression is also used: The lowest bit of DocSkip
>>> is used to indicate if the payload length is stored for a SkipDatum
>>> or if the length is the same as in the last SkipDatum.
>>> - Payloads are loaded lazily. When a user calls
>>> TermPositions.nextPosition() then only the position and the payload
>>> length is loaded from the ProxFile. If the user calls getPayload()
>>> then the payload is actually loaded. If getPayload() is not called
>>> before nextPosition() is called again, then the payload data is just
>>> skipped.
>>>
>>> Changes of file formats
>>> ------------------------------
>>> - FieldInfos (.fnm)
>>> The format of the .fnm file does not change. The only change is the
>>> use of the sixth lowest-order bit (0x20) of the FieldBits. If this
>>> bit is set, then payloads are enabled for the corresponding field.
>>> - ProxFile (.prx)
>>> ProxFile (.prx) -->  <TermPositions>^TermCount
>>> TermPositions   --> <Positions>^DocFreq
>>> Positions       --> <PositionDelta, Payload?>^Freq
>>> Payload         --> <PayloadLength?, PayloadData>
>>> PositionDelta   --> VInt
>>> PayloadLength   --> VInt
>>> PayloadData     --> byte^PayloadLength
>>> For payloads disabled (unchanged):
>>> PositionDelta is the difference between the position of the current
>>> occurrence in the document and the previous occurrence (or zero, if
>>> this is the first   occurrence in this document).
>>>
>>> For Payloads enabled:
>>> PositionDelta/2 is the difference between the position of the
>>> current occurrence in the document and the previous occurrence. If
>>> PositionDelta is odd, then PayloadLength is stored. If PositionDelta
>>> is even, then the length of the current payload equals the length of
>>> the previous payload and thus PayloadLength is omitted.
>>> - FreqFile (.frq)
>>> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
>>> PayloadLength --> VInt
>>> For payloads disabled (unchanged):
>>> DocSkip records the document number before every SkipInterval th
>>> document in TermFreqs. Document numbers are represented as
>>> differences from the previous value in the sequence.
>>> For payloads enabled:
>>> DocSkip/2 records the document number before every SkipInterval th  
>>> document in TermFreqs. If DocSkip is odd, then PayloadLength
>>> follows. If DocSkip is even, then the length of the payload at the
>>> current skip point equals the length of the payload at the last skip
>>> point and thus PayloadLength is omitted.
>>> This encoding is space efficient for different use cases:
>>>    * If only some fields of an index have payloads, then there's no
>>> space overhead for the fields with payloads disabled.
>>>    * If the payloads of consecutive term positions have the same
>>> length, then the length only has to be stored once for every term.
>>> This should be a common case, because users probably use the same
>>> format for all payloads.
>>>    * If only a few terms of a field have payloads, then we don't
>>> waste much space because we benefit again from the
>>> same-length-compression since we only have to store the length zero
>>> for the empty payloads once per term.
>>> All unit tests pass.
>>
>> --This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible indexing (was: Re: Commented: (LUCENE-755) Payloads)

Grant Ingersoll-4
Hi Michael,

This is very good.  I know 662 is different, just wasn't sure if  
Nicolas patch was meant to be applied after 662, b/c I know we had  
discussed this before.

I do agree with you about planning this out, but I also know that  
patches seem to motivate people the best and provide a certain  
concreteness to it all.  I mostly started asking questions on these  
two issues b/c I wanted to spur some more discussion and see if we  
can get people motivated to move on it.

I was hoping that I would be able to apply each patch to two  
different checkouts so I could start seeing where the overlap is and  
how they could fit together (I also admit I was procrastinating on my  
ApacheCon talk...).  In the new, flexible world, the payloads  
implementation could be a separate implementation of the indexing or  
it could be part of the core/existing file format implementation.  
Sometimes I just need to get my hands on the code to get a real feel  
for what I feel is the best way to do it.

I agree about the XML storage for Index information.  We do that in  
our in-house wrapper around Lucene, storing info about the language,  
analyzer used, etc.  We may also want a binary index-level storage  
capability.  I know most people just create a single document usually  
to store binary info about the index, but an binary storage might be  
good too.

Part of me says to apply the Payloads patch now, as it provides a lot  
of bang for the buck and I think the FI is going to take a lot longer  
to hash out.  However, I know that it may pin us in or force us to  
change things for FI.  Ultimately, I would love to see both these  
features for the next release, but that isn't a requirement.  Also,  
on FI, I would love to see two different implementations of whatever  
API we choose before releasing it, as I always find two  
implementations of an Interface really work out the API details.

-Grant


On Mar 10, 2007, at 6:27 PM, Michael Busch wrote:

> Hi Grant,
>
> LUCENE-662 contains different ideas:
> 1) introduction of an index format concept
> 2) extensibility of the store reader/writer
> 3) New: extensibility of the posting reader/writer
>
> IMO we should split this up, that way it will be easier to develop  
> smaller patches that focus on adding one particular feature.  
> However, it is important to plan the API, so that different patches  
> (like payloads) fit in. On the other hand it will be nearly  
> impossible to plan an API that is perfect and won't change anymore  
> without having the actual implementions. Therefore I suggest the  
> following steps:
> a) define the different work items of flexible indexing
> b) plan a API rougly that fits with all items
> c) develop the different items, commit them but with either  
> protected or as experimental marked APIs
> d) after all items are completed and committed (and hopefully  
> tested by some brave community members ;)) finalize the API and  
> remove experimental comments (or make public)
>
> Let's start with a):
>
> The following items come to my mind (please feel free to add/remove/
> complain):
> - Introduce index-level metadata. Preferable in XML format, so it  
> will be human readable. Later on, we can store information about  
> the index format in this file, like the codecs that are used to  
> store the data. We should also make this public, so that users can  
> store their own index metadata. (Remark: LUCENE-783 is also a neat  
> idea, we can write one xml parser for both items)
>
> - Introduce index format. Nicolas has already written a lot of code  
> in this regard! It will include different interfaces for the  
> different extension points (FieldsFormat, PostingFormat,  
> DictionaryFormat). We can use the xml file to store which actual  
> formats are used in the corresponding index.
>
> - Implement the different extensions. LUCENE-662 includes an  
> extensible FieldsWriter, LUCENE-755 the payloads feature. Doug and  
> Ning suggested already nice interfaces for PostingFormat and  
> DictionaryFormat in the payloads thread on java-dev.
>
> - Write standard implementations for the different formats. In the  
> wiki is already a list of desired posting formats.
>
>
> I suggest we should finalize this list first. Then I will add this  
> list to the wiki under Flexible indexing and gather information  
> from the different discussions on java-dev which I already  
> mentioned. Then we should discuss the different items of this list  
> in greater depth and plan the APIs (step b) ).  And then we're  
> already ready for step c) and the fun starts :-).
>
> Michael
>
>
> Grant Ingersoll wrote:
>> I think it makes the most sense to get flexible indexing in first,  
>> and then make payloads work with it.  On the other hand, payloads  
>> looked pretty straightforward to me, whereas FI is much more  
>> involved (or at least it feels that way).
>>
>> As it is right now, I would like to at least review the two  
>> patches and start thinking about them in greater depth.  The  
>> payloads patch needs a little more work in that I want to  
>> integrate it with the Similarity class so people can customize  
>> their scoring.
>>
>> -Grant
>>
>> On Mar 10, 2007, at 9:30 AM, Nicolas Lalevée (JIRA) wrote:
>>
>>>
>>>     [ https://issues.apache.org/jira/browse/LUCENE-755?
>>> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>>> tabpanel#action_12479841 ]
>>>
>>> Nicolas Lalevée commented on LUCENE-755:
>>> ----------------------------------------
>>>
>>> Grant>
>>> The patch I have propsed here has no dependency on LUCENE-662, I  
>>> just "imported" some ideas from it and put them there. Since the  
>>> LUCENE-662 have involved, the patches will probably make  
>>> conflicts. The best to use here is Michael's one. I think it  
>>> won't conflit with LUCENE-662. And if both are intended to be  
>>> commited, then the best is to commit the both seperately and redo  
>>> the work I have done with the provided patch (I remember that it  
>>> was quite easy).
>>>
>>>
>>>> Payloads
>>>> --------
>>>>
>>>>                 Key: LUCENE-755
>>>>                 URL: https://issues.apache.org/jira/browse/ 
>>>> LUCENE-755
>>>>             Project: Lucene - Java
>>>>          Issue Type: New Feature
>>>>          Components: Index
>>>>            Reporter: Michael Busch
>>>>         Assigned To: Michael Busch
>>>>         Attachments: payload.patch, payloads.patch
>>>>
>>>>
>>>> This patch adds the possibility to store arbitrary metadata  
>>>> (payloads) together with each position of a term in its posting  
>>>> lists. A while ago this was discussed on the dev mailing list,  
>>>> where I proposed an initial design. This patch has a much  
>>>> improved design with modifications, that make this new feature  
>>>> easier to use and more efficient.
>>>> A payload is an array of bytes that can be stored inline in the  
>>>> ProxFile (.prx). Therefore this patch provides low-level APIs to  
>>>> simply store and retrieve byte arrays in the posting lists in an  
>>>> efficient way.
>>>> API and Usage
>>>> ------------------------------
>>>> The new class index.Payload is basically just a wrapper around a  
>>>> byte[] array together with int variables for offset and length.  
>>>> So a user does not have to create a byte array for every  
>>>> payload, but can rather allocate one array for all payloads of a  
>>>> document and provide offset and length information. This reduces  
>>>> object allocations on the application side.
>>>> In order to store payloads in the posting lists one has to  
>>>> provide a TokenStream or TokenFilter that produces Tokens with  
>>>> payloads. I added the following two methods to the Token class:
>>>>   /** Sets this Token's payload. */
>>>>   public void setPayload(Payload payload);
>>>>
>>>>   /** Returns this Token's payload. */
>>>>   public Payload getPayload();
>>>> In order to retrieve the data from the index the interface  
>>>> TermPositions now offers two new methods:
>>>>   /** Returns the payload length of the current term position.
>>>>    *  This is invalid until {@link #nextPosition()} is called for
>>>>    *  the first time.
>>>>    *
>>>>    * @return length of the current payload in number of bytes
>>>>    */
>>>>   int getPayloadLength();
>>>>
>>>>   /** Returns the payload data of the current term position.
>>>>    * This is invalid until {@link #nextPosition()} is called for
>>>>    * the first time.
>>>>    * This method must not be called more than once after each call
>>>>    * of {@link #nextPosition()}. However, payloads are loaded  
>>>> lazily,
>>>>    * so if the payload data for the current position is not needed,
>>>>    * this method may not be called at all for performance reasons.
>>>>    *
>>>>    * @param data the array into which the data of this payload  
>>>> is to be
>>>>    *             stored, if it is big enough; otherwise, a new  
>>>> byte[] array
>>>>    *             is allocated for this purpose.
>>>>    * @param offset the offset in the array into which the data  
>>>> of this payload
>>>>    *               is to be stored.
>>>>    * @return a byte[] array containing the data of this payload
>>>>    * @throws IOException
>>>>    */
>>>>   byte[] getPayload(byte[] data, int offset) throws IOException;
>>>> Furthermore, this patch indroduces the new method  
>>>> IndexOutput.writeBytes(byte[] b, int offset, int length). So far  
>>>> there was only a writeBytes()-method without an offset argument.
>>>> Implementation details
>>>> ------------------------------
>>>> - One field bit in FieldInfos is used to indicate if payloads  
>>>> are enabled for a field. The user does not have to enable  
>>>> payloads for a field, this is done automatically:
>>>>    * The DocumentWriter enables payloads for a field, if one ore  
>>>> more Tokens carry payloads.
>>>>    * The SegmentMerger enables payloads for a field during a  
>>>> merge, if payloads are enabled for that field in one or more  
>>>> segments.
>>>> - Backwards compatible: If payloads are not used, then the  
>>>> formats of the ProxFile and FreqFile don't change
>>>> - Payloads are stored inline in the posting list of a term in  
>>>> the ProxFile. A payload of a term occurrence is stored right  
>>>> after its PositionDelta.
>>>> - Same-length compression: If payloads are enabled for a field,  
>>>> then the PositionDelta is shifted one bit. The lowest bit is  
>>>> used to indicate whether the length of the following payload is  
>>>> stored explicitly. If not, i. e. the bit is false, then the  
>>>> payload has the same length as the payload of the previous term  
>>>> occurrence.
>>>> - In order to support skipping on the ProxFile the length of the  
>>>> payload at every skip point has to be known. Therefore the  
>>>> payload length is also stored in the skip list located in the  
>>>> FreqFile. Here the same-length compression is also used: The  
>>>> lowest bit of DocSkip is used to indicate if the payload length  
>>>> is stored for a SkipDatum or if the length is the same as in the  
>>>> last SkipDatum.
>>>> - Payloads are loaded lazily. When a user calls  
>>>> TermPositions.nextPosition() then only the position and the  
>>>> payload length is loaded from the ProxFile. If the user calls  
>>>> getPayload() then the payload is actually loaded. If getPayload
>>>> () is not called before nextPosition() is called again, then the  
>>>> payload data is just skipped.
>>>>
>>>> Changes of file formats
>>>> ------------------------------
>>>> - FieldInfos (.fnm)
>>>> The format of the .fnm file does not change. The only change is  
>>>> the use of the sixth lowest-order bit (0x20) of the FieldBits.  
>>>> If this bit is set, then payloads are enabled for the  
>>>> corresponding field.
>>>> - ProxFile (.prx)
>>>> ProxFile (.prx) -->  <TermPositions>^TermCount
>>>> TermPositions   --> <Positions>^DocFreq
>>>> Positions       --> <PositionDelta, Payload?>^Freq
>>>> Payload         --> <PayloadLength?, PayloadData>
>>>> PositionDelta   --> VInt
>>>> PayloadLength   --> VInt
>>>> PayloadData     --> byte^PayloadLength
>>>> For payloads disabled (unchanged):
>>>> PositionDelta is the difference between the position of the  
>>>> current occurrence in the document and the previous occurrence  
>>>> (or zero, if this is the first   occurrence in this document).
>>>>
>>>> For Payloads enabled:
>>>> PositionDelta/2 is the difference between the position of the  
>>>> current occurrence in the document and the previous occurrence.  
>>>> If PositionDelta is odd, then PayloadLength is stored. If  
>>>> PositionDelta is even, then the length of the current payload  
>>>> equals the length of the previous payload and thus PayloadLength  
>>>> is omitted.
>>>> - FreqFile (.frq)
>>>> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
>>>> PayloadLength --> VInt
>>>> For payloads disabled (unchanged):
>>>> DocSkip records the document number before every SkipInterval th  
>>>> document in TermFreqs. Document numbers are represented as  
>>>> differences from the previous value in the sequence.
>>>> For payloads enabled:
>>>> DocSkip/2 records the document number before every SkipInterval  
>>>> th  document in TermFreqs. If DocSkip is odd, then PayloadLength  
>>>> follows. If DocSkip is even, then the length of the payload at  
>>>> the current skip point equals the length of the payload at the  
>>>> last skip point and thus PayloadLength is omitted.
>>>> This encoding is space efficient for different use cases:
>>>>    * If only some fields of an index have payloads, then there's  
>>>> no space overhead for the fields with payloads disabled.
>>>>    * If the payloads of consecutive term positions have the same  
>>>> length, then the length only has to be stored once for every  
>>>> term. This should be a common case, because users probably use  
>>>> the same format for all payloads.
>>>>    * If only a few terms of a field have payloads, then we don't  
>>>> waste much space because we benefit again from the same-length-
>>>> compression since we only have to store the length zero for the  
>>>> empty payloads once per term.
>>>> All unit tests pass.
>>>
>>> --This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>> ------------------------------------------------------
>> Grant Ingersoll
>> http://www.grantingersoll.com/
>> http://lucene.grantingersoll.com
>> http://www.paperoftheweek.com/
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible indexing

Michael Busch
Hi Grant,

I certainly agree that it would be great if we could make some progress
and commit the payloads patch soon. I think it is quite independent from
FI. FI will introduce different posting formats (see Wiki:
http://wiki.apache.org/lucene-java/FlexibleIndexing). Payloads will be
part of some of those formats, but not all (i. e. per-position payloads
only make sense if positions are stored).

The only concern some people had was about the API the patch introduces.
It extends Token and TermPositions. Doug's argument was, that if we
introduce new APIs now but want to change them with FI, then it will be
hard to support those APIs. I think that is a valid point, but at the
same time it slows down progress to have to plan ahead in too many
directions. That's why I'd vote for marking the new APIs as experimental
so that people can try them out at own risk.
If we could agree on that approach then I'd go ahead and submit an
updated payloads patch in the next days, that applies cleanly on the
current trunk and contains the additional warnings in the javadocs.


In regard of FI and 662 however I really believe we should split it up
and plan ahead (in a way I mentioned already), so that we have more
isolated patches. It is really great that we have 662 already (Nicolas,
thank you so much for your hard work, I hope you'll keep working with us
on FI!!). We'll probably use some of that code, and it will definitely
be helpful.

Michael

Grant Ingersoll wrote:

> Hi Michael,
>
> This is very good.  I know 662 is different, just wasn't sure if
> Nicolas patch was meant to be applied after 662, b/c I know we had
> discussed this before.
>
> I do agree with you about planning this out, but I also know that
> patches seem to motivate people the best and provide a certain
> concreteness to it all.  I mostly started asking questions on these
> two issues b/c I wanted to spur some more discussion and see if we can
> get people motivated to move on it.
>
> I was hoping that I would be able to apply each patch to two different
> checkouts so I could start seeing where the overlap is and how they
> could fit together (I also admit I was procrastinating on my ApacheCon
> talk...).  In the new, flexible world, the payloads implementation
> could be a separate implementation of the indexing or it could be part
> of the core/existing file format implementation.  Sometimes I just
> need to get my hands on the code to get a real feel for what I feel is
> the best way to do it.
>
> I agree about the XML storage for Index information.  We do that in
> our in-house wrapper around Lucene, storing info about the language,
> analyzer used, etc.  We may also want a binary index-level storage
> capability.  I know most people just create a single document usually
> to store binary info about the index, but an binary storage might be
> good too.
>
> Part of me says to apply the Payloads patch now, as it provides a lot
> of bang for the buck and I think the FI is going to take a lot longer
> to hash out.  However, I know that it may pin us in or force us to
> change things for FI.  Ultimately, I would love to see both these
> features for the next release, but that isn't a requirement.  Also, on
> FI, I would love to see two different implementations of whatever API
> we choose before releasing it, as I always find two implementations of
> an Interface really work out the API details.
>
> -Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Updated: (LUCENE-755) Payloads

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-755:
---------------------------------

    Attachment: payloads.patch

I'm attaching the new patch with the following changes:
- applies cleanly on the current trunk
- fixed a bug in FSDirectory which affected payloads with length greater than 1024 bytes and extended testcase TestPayloads to test this fix
- added the following warning comments to the new APIs:

  *  Warning: The status of the Payloads feature is experimental. The APIs
  *  introduced here might change in the future and will not be supported anymore
  *  in such a case. If you want to use this feature in a production environment
  *  you should wait for an official release.


Another comment about an API change: In BufferedIndexOutput I changed the method
  protected abstract void flushBuffer(byte[] b, int len) throws IOException;
to
  protected abstract void flushBuffer(byte[] b, int offset, int len) throws IOException;

which means that subclasses of BufferedIndexOutput won't compile anymore. I made this change for performance reasons: If a payload is longer than 1024 bytes (standard buffer size of BufferedIndexOutput) then it can be flushed efficiently to disk without having to perform array copies.

Is this API change acceptable? Users who have custom subclasses of BufferedIndexOutput would have to change their classes in order to work.

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way.
> API and Usage
> ------------------------------  
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>  
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    *
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>  
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    *
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose.
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument.
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>  
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field.
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>  
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible indexing

Grant Ingersoll-4
In reply to this post by Michael Busch

On Mar 11, 2007, at 5:41 PM, Michael Busch wrote:

> Hi Grant,
>
> I certainly agree that it would be great if we could make some  
> progress and commit the payloads patch soon. I think it is quite  
> independent from FI. FI will introduce different posting formats  
> (see Wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing).  
> Payloads will be part of some of those formats, but not all (i. e.  
> per-position payloads only make sense if positions are stored).
>

Yep, I agree.

> The only concern some people had was about the API the patch  
> introduces. It extends Token and TermPositions. Doug's argument  
> was, that if we introduce new APIs now but want to change them with  
> FI, then it will be hard to support those APIs. I think that is a  
> valid point, but at the same time it slows down progress to have to  
> plan ahead in too many directions. That's why I'd vote for marking  
> the new APIs as experimental so that people can try them out at own  
> risk.
> If we could agree on that approach then I'd go ahead and submit an  
> updated payloads patch in the next days, that applies cleanly on  
> the current trunk and contains the additional warnings in the  
> javadocs.
>

+1.

>
> In regard of FI and 662 however I really believe we should split it  
> up and plan ahead (in a way I mentioned already), so that we have  
> more isolated patches. It is really great that we have 662 already  
> (Nicolas, thank you so much for your hard work, I hope you'll keep  
> working with us on FI!!). We'll probably use some of that code, and  
> it will definitely be helpful.
>

+1  I think this makes a lot of sense.  We have been deliberating  
these changes for some time, so no reason to hurry.  I don't think  
they are urgent, yet they really will give us more flexibility and  
more capabilities for more people, so it will be a good thing to have.


> Michael
>
> Grant Ingersoll wrote:
>> Hi Michael,
>>
>> This is very good.  I know 662 is different, just wasn't sure if  
>> Nicolas patch was meant to be applied after 662, b/c I know we had  
>> discussed this before.
>>
>> I do agree with you about planning this out, but I also know that  
>> patches seem to motivate people the best and provide a certain  
>> concreteness to it all.  I mostly started asking questions on  
>> these two issues b/c I wanted to spur some more discussion and see  
>> if we can get people motivated to move on it.
>>
>> I was hoping that I would be able to apply each patch to two  
>> different checkouts so I could start seeing where the overlap is  
>> and how they could fit together (I also admit I was  
>> procrastinating on my ApacheCon talk...).  In the new, flexible  
>> world, the payloads implementation could be a separate  
>> implementation of the indexing or it could be part of the core/
>> existing file format implementation.  Sometimes I just need to get  
>> my hands on the code to get a real feel for what I feel is the  
>> best way to do it.
>>
>> I agree about the XML storage for Index information.  We do that  
>> in our in-house wrapper around Lucene, storing info about the  
>> language, analyzer used, etc.  We may also want a binary index-
>> level storage capability.  I know most people just create a single  
>> document usually to store binary info about the index, but an  
>> binary storage might be good too.
>>
>> Part of me says to apply the Payloads patch now, as it provides a  
>> lot of bang for the buck and I think the FI is going to take a lot  
>> longer to hash out.  However, I know that it may pin us in or  
>> force us to change things for FI.  Ultimately, I would love to see  
>> both these features for the next release, but that isn't a  
>> requirement.  Also, on FI, I would love to see two different  
>> implementations of whatever API we choose before releasing it, as  
>> I always find two implementations of an Interface really work out  
>> the API details.
>>
>> -Grant
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Updated: (LUCENE-755) Payloads

Grant Ingersoll-4
In reply to this post by JIRA jira@apache.org
Cool.  I will try and take a look at it tomorrow.  Since we have the  
lazy SegTermPos thing in now, we should be able to integrate this  
into scoring via the Similarity and merge TermDocs and TermPositions  
like you suggested.

If I can get the Scoring piece in and people are fine w/ the  
flushBuffer change then hopefully we can get this in this week.  I  
will try to post a patch that includes your patch and the scoring  
integration by tomorrow or Tuesday if that is fine with you.

-Grant

On Mar 11, 2007, at 8:35 PM, Michael Busch (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/LUCENE-755?
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Michael Busch updated LUCENE-755:
> ---------------------------------
>
>     Attachment: payloads.patch
>
> I'm attaching the new patch with the following changes:
> - applies cleanly on the current trunk
> - fixed a bug in FSDirectory which affected payloads with length  
> greater than 1024 bytes and extended testcase TestPayloads to test  
> this fix
> - added the following warning comments to the new APIs:
>
>   *  Warning: The status of the Payloads feature is experimental.  
> The APIs
>   *  introduced here might change in the future and will not be  
> supported anymore
>   *  in such a case. If you want to use this feature in a  
> production environment
>   *  you should wait for an official release.
>
>
> Another comment about an API change: In BufferedIndexOutput I  
> changed the method
>   protected abstract void flushBuffer(byte[] b, int len) throws  
> IOException;
> to
>   protected abstract void flushBuffer(byte[] b, int offset, int  
> len) throws IOException;
>
> which means that subclasses of BufferedIndexOutput won't compile  
> anymore. I made this change for performance reasons: If a payload  
> is longer than 1024 bytes (standard buffer size of  
> BufferedIndexOutput) then it can be flushed efficiently to disk  
> without having to perform array copies.
>
> Is this API change acceptable? Users who have custom subclasses of  
> BufferedIndexOutput would have to change their classes in order to  
> work.
>
>> Payloads
>> --------
>>
>>                 Key: LUCENE-755
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: Index
>>            Reporter: Michael Busch
>>         Assigned To: Michael Busch
>>         Attachments: payload.patch, payloads.patch, payloads.patch
>>
>>
>> This patch adds the possibility to store arbitrary metadata  
>> (payloads) together with each position of a term in its posting  
>> lists. A while ago this was discussed on the dev mailing list,  
>> where I proposed an initial design. This patch has a much improved  
>> design with modifications, that make this new feature easier to  
>> use and more efficient.
>> A payload is an array of bytes that can be stored inline in the  
>> ProxFile (.prx). Therefore this patch provides low-level APIs to  
>> simply store and retrieve byte arrays in the posting lists in an  
>> efficient way.
>> API and Usage
>> ------------------------------
>> The new class index.Payload is basically just a wrapper around a  
>> byte[] array together with int variables for offset and length. So  
>> a user does not have to create a byte array for every payload, but  
>> can rather allocate one array for all payloads of a document and  
>> provide offset and length information. This reduces object  
>> allocations on the application side.
>> In order to store payloads in the posting lists one has to provide  
>> a TokenStream or TokenFilter that produces Tokens with payloads. I  
>> added the following two methods to the Token class:
>>   /** Sets this Token's payload. */
>>   public void setPayload(Payload payload);
>>
>>   /** Returns this Token's payload. */
>>   public Payload getPayload();
>> In order to retrieve the data from the index the interface  
>> TermPositions now offers two new methods:
>>   /** Returns the payload length of the current term position.
>>    *  This is invalid until {@link #nextPosition()} is called for
>>    *  the first time.
>>    *
>>    * @return length of the current payload in number of bytes
>>    */
>>   int getPayloadLength();
>>
>>   /** Returns the payload data of the current term position.
>>    * This is invalid until {@link #nextPosition()} is called for
>>    * the first time.
>>    * This method must not be called more than once after each call
>>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>>    * so if the payload data for the current position is not needed,
>>    * this method may not be called at all for performance reasons.
>>    *
>>    * @param data the array into which the data of this payload is  
>> to be
>>    *             stored, if it is big enough; otherwise, a new byte
>> [] array
>>    *             is allocated for this purpose.
>>    * @param offset the offset in the array into which the data of  
>> this payload
>>    *               is to be stored.
>>    * @return a byte[] array containing the data of this payload
>>    * @throws IOException
>>    */
>>   byte[] getPayload(byte[] data, int offset) throws IOException;
>> Furthermore, this patch indroduces the new method  
>> IndexOutput.writeBytes(byte[] b, int offset, int length). So far  
>> there was only a writeBytes()-method without an offset argument.
>> Implementation details
>> ------------------------------
>> - One field bit in FieldInfos is used to indicate if payloads are  
>> enabled for a field. The user does not have to enable payloads for  
>> a field, this is done automatically:
>>    * The DocumentWriter enables payloads for a field, if one ore  
>> more Tokens carry payloads.
>>    * The SegmentMerger enables payloads for a field during a  
>> merge, if payloads are enabled for that field in one or more  
>> segments.
>> - Backwards compatible: If payloads are not used, then the formats  
>> of the ProxFile and FreqFile don't change
>> - Payloads are stored inline in the posting list of a term in the  
>> ProxFile. A payload of a term occurrence is stored right after its  
>> PositionDelta.
>> - Same-length compression: If payloads are enabled for a field,  
>> then the PositionDelta is shifted one bit. The lowest bit is used  
>> to indicate whether the length of the following payload is stored  
>> explicitly. If not, i. e. the bit is false, then the payload has  
>> the same length as the payload of the previous term occurrence.
>> - In order to support skipping on the ProxFile the length of the  
>> payload at every skip point has to be known. Therefore the payload  
>> length is also stored in the skip list located in the FreqFile.  
>> Here the same-length compression is also used: The lowest bit of  
>> DocSkip is used to indicate if the payload length is stored for a  
>> SkipDatum or if the length is the same as in the last SkipDatum.
>> - Payloads are loaded lazily. When a user calls  
>> TermPositions.nextPosition() then only the position and the  
>> payload length is loaded from the ProxFile. If the user calls  
>> getPayload() then the payload is actually loaded. If getPayload()  
>> is not called before nextPosition() is called again, then the  
>> payload data is just skipped.
>>
>> Changes of file formats
>> ------------------------------
>> - FieldInfos (.fnm)
>> The format of the .fnm file does not change. The only change is  
>> the use of the sixth lowest-order bit (0x20) of the FieldBits. If  
>> this bit is set, then payloads are enabled for the corresponding  
>> field.
>> - ProxFile (.prx)
>> ProxFile (.prx) -->  <TermPositions>^TermCount
>> TermPositions   --> <Positions>^DocFreq
>> Positions       --> <PositionDelta, Payload?>^Freq
>> Payload         --> <PayloadLength?, PayloadData>
>> PositionDelta   --> VInt
>> PayloadLength   --> VInt
>> PayloadData     --> byte^PayloadLength
>> For payloads disabled (unchanged):
>> PositionDelta is the difference between the position of the  
>> current occurrence in the document and the previous occurrence (or  
>> zero, if this is the first   occurrence in this document).
>>
>> For Payloads enabled:
>> PositionDelta/2 is the difference between the position of the  
>> current occurrence in the document and the previous occurrence. If  
>> PositionDelta is odd, then PayloadLength is stored. If  
>> PositionDelta is even, then the length of the current payload  
>> equals the length of the previous payload and thus PayloadLength  
>> is omitted.
>> - FreqFile (.frq)
>> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
>> PayloadLength --> VInt
>> For payloads disabled (unchanged):
>> DocSkip records the document number before every SkipInterval th  
>> document in TermFreqs. Document numbers are represented as  
>> differences from the previous value in the sequence.
>> For payloads enabled:
>> DocSkip/2 records the document number before every SkipInterval  
>> th  document in TermFreqs. If DocSkip is odd, then PayloadLength  
>> follows. If DocSkip is even, then the length of the payload at the  
>> current skip point equals the length of the payload at the last  
>> skip point and thus PayloadLength is omitted.
>> This encoding is space efficient for different use cases:
>>    * If only some fields of an index have payloads, then there's  
>> no space overhead for the fields with payloads disabled.
>>    * If the payloads of consecutive term positions have the same  
>> length, then the length only has to be stored once for every term.  
>> This should be a common case, because users probably use the same  
>> format for all payloads.
>>    * If only a few terms of a field have payloads, then we don't  
>> waste much space because we benefit again from the same-length-
>> compression since we only have to store the length zero for the  
>> empty payloads once per term.
>> All unit tests pass.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Updated: (LUCENE-755) Payloads

Michael Busch
Grant Ingersoll wrote:

> Cool.  I will try and take a look at it tomorrow.  Since we have the
> lazy SegTermPos thing in now, we should be able to integrate this into
> scoring via the Similarity and merge TermDocs and TermPositions like
> you suggested.
>
> If I can get the Scoring piece in and people are fine w/ the
> flushBuffer change then hopefully we can get this in this week.  I
> will try to post a patch that includes your patch and the scoring
> integration by tomorrow or Tuesday if that is fine with you.
>
I'm not completely sure how you want to integrate this in the Similarity
class. Payloads can not only be used for scoring. Consider for example
XML search: the payloads can be used here to store in which element a
term occurs. During search (e. g. an XPath query) the payloads would be
used then to find hits, not for scoring.

On the other hand if you want to store e. g. per-postions boosts in the
payloads, you could use the norm en/decoding methods that are already in
Similarity. You could use the following code in a TokenStream:
  byte[] payload = new byte[1];
  payload[0] = Similari.encodeNorm(boost);
  token.setPayload(payload);

and in a scorer you could get the boost then with:
  termPositions.getPayload(payloadBuffer);
  float boost = Similarity.decodeNorm(payloadBuffer[0]);

But maybe you have something different in mind? Could you elaborate, please?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible indexing

Michael Busch
In reply to this post by Grant Ingersoll-4
Grant Ingersoll wrote:

>
>> In regard of FI and 662 however I really believe we should split it
>> up and plan ahead (in a way I mentioned already), so that we have
>> more isolated patches. It is really great that we have 662 already
>> (Nicolas, thank you so much for your hard work, I hope you'll keep
>> working with us on FI!!). We'll probably use some of that code, and
>> it will definitely be helpful.
>>
>
> +1  I think this makes a lot of sense.  We have been deliberating
> these changes for some time, so no reason to hurry.  I don't think
> they are urgent, yet they really will give us more flexibility and
> more capabilities for more people, so it will be a good thing to have.
>

Right, we don't have to hurry. But still it would be cool to have some
of the FI features in the next release and once we start (now!) we
should try to keep the momentum going!

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Updated: (LUCENE-755) Payloads

Grant Ingersoll-2
In reply to this post by Michael Busch
I haven't looked at your latest patch yet, so this is just guesswork,  
but was thinking in TermScorer, around line 75 or so, we could add:

score *= similarity.scorePayload(payloadBuffer);

The default Similarity would just return 1.  This would allow people  
to incorporate a score based on what is in the payload, per their  
application needs and would be completely backward-compatible.  We  
may even want to postpone the decoding of the payload to inside the  
Similarity for performance reasons, but that should be tested, since  
that could be cause for confusion for people overriding Similarity.  
I will have to look at some of the other Scorers to see if there is a  
way to incorporate into some of them.

None of this would prevent using payloads for other things as well,  
such as the XPath query example.

Doing this would involve switching over to using TermPositions like  
we talked about.  Like I said, I will take a look at it and see if  
anything resonates.

-Grant

On Mar 11, 2007, at 11:26 PM, Michael Busch wrote:

> Grant Ingersoll wrote:
>> Cool.  I will try and take a look at it tomorrow.  Since we have  
>> the lazy SegTermPos thing in now, we should be able to integrate  
>> this into scoring via the Similarity and merge TermDocs and  
>> TermPositions like you suggested.
>>
>> If I can get the Scoring piece in and people are fine w/ the  
>> flushBuffer change then hopefully we can get this in this week.  I  
>> will try to post a patch that includes your patch and the scoring  
>> integration by tomorrow or Tuesday if that is fine with you.
>>
> I'm not completely sure how you want to integrate this in the  
> Similarity class. Payloads can not only be used for scoring.  
> Consider for example XML search: the payloads can be used here to  
> store in which element a term occurs. During search (e. g. an XPath  
> query) the payloads would be used then to find hits, not for scoring.
>
> On the other hand if you want to store e. g. per-postions boosts in  
> the payloads, you could use the norm en/decoding methods that are  
> already in Similarity. You could use the following code in a  
> TokenStream:
>  byte[] payload = new byte[1];
>  payload[0] = Similari.encodeNorm(boost);
>  token.setPayload(payload);
>
> and in a scorer you could get the boost then with:
>  termPositions.getPayload(payloadBuffer);
>  float boost = Similarity.decodeNorm(payloadBuffer[0]);
>
> But maybe you have something different in mind? Could you  
> elaborate, please?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Updated: (LUCENE-755) Payloads

Michael Busch
Grant Ingersoll wrote:
> I haven't looked at your latest patch yet, so this is just guesswork,
> but was thinking in TermScorer, around line 75 or so, we could add:
>
> score *= similarity.scorePayload(payloadBuffer);
>
TermScorer currently doesn't iterate over the positions. It uses a
buffer to load 32 doc/freq pairs from TermDocs using the read() method.
In order to use per-term boosts you would have to change the TermScorer
to not use a buffer anymore and use TermDocs.next() instead. Then you
can iterate over the positions and get the payloads. This is a
significant change to TermScorer and performance would probably suffer
for indexes that don't have payloads. I actually admit that I had the
same in mind (I mentioned that in LUCENE-761), but after looking closer
at TermScorer I changed my mind here.

I believe the better option is to create a new scorer subclass like
WeightedTermScorer which should be used if payloads containing per-term
boosts are stored in the index.

- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Flexible indexing

Marvin Humphrey
In reply to this post by Michael Busch

On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:

I'm going to respond to this over several mails (: and possibly  
days :) because there's an awful lot here, and I've already  
implemented a lot of it in KS.

> We should also make this public, so that users can store their own  
> index metadata.
> (Remark: LUCENE-783 is also a neat idea, we can write one xml  
> parser for both items)

There's a significant downside to allowing users to store arbitrary  
data in an XML index file: you can't use a bare-bones parser, hand-
coded for a tiny, controlled subset of XML syntax and a limited set  
of data structures.  You'd need a full-on XML encoder/decoder,  
presumably an existing one that would be added as a dependency.

The only reason that the KinoSearch's YAML codec requires only 600  
lines of C is that it's a closed system.  No multi-line strings.  No  
objects.  No nulls.  You get the picture.

Is there anything you're envisioning that can't be done using a  
wrapper class and auxiliary/external files?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12