Field compression and storage optimization

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Field compression and storage optimization

Mike Klaas
Hi,

I was thinking of enabling the compressed=True field option (which
currently has no effect), as compression is important for highlighting
large fields (since they must be stored).

However, rather than exposing a lucene implementation detail, I
decided to create a FieldType which dynamically chooses to compress
and/or term-vector a field depending on the field length (configurable
in the field type).

Any objections to commiting this?

-Mike

public class HighlitTextField extends TextField {
  /* if field size (in characters) is greater than this threshold, the field
     will be stored compressed */
  public static int DEFAULT_COMPRESS_THRESHOLD = 200;
  /* if field size (in characters) is greater than this threshold, the field
     will have term vector data stored */
  public static int DEFAULT_TERMVEC_THRESHOLD = 500;

  int compressThreshold;
  int termVecThreshold;

  private static String CT = "compressThreshold";
  private static String TV = "termVecThreshold";

  protected void init(IndexSchema schema, Map<String,String> args) {
    SolrParams p = new MapSolrParams(args);
    compressThreshold = p.getInt(CT, DEFAULT_COMPRESS_THRESHOLD);
    termVecThreshold = p.getInt(TV, DEFAULT_TERMVEC_THRESHOLD);
    for(String prop: new String[]{CT, TV})
      args.remove(prop);
    super.init(schema, args);
  }

    /* Helpers for field construction */
  protected Field.TermVector getFieldTermVec(SchemaField field,
                                             String internalVal) {
    /* store all termvec data if field length exceeds threshold */
    return internalVal.length() >= termVecThreshold ?
      Field.TermVector.WITH_POSITIONS_OFFSETS : Field.TermVector.NO;
  }
  protected Field.Store getFieldStore(SchemaField field,
                                      String internalVal) {
    /* compress field if length exceeds threshold */
    return internalVal.length() >= compressThreshold ?
      Field.Store.COMPRESS : Field.Store.YES;

  }
}
Reply | Threaded
Open this post in threaded view
|

Re: Field compression and storage optimization

Yonik Seeley-2
A couple of thoughts...
 - should this be specific to highlighting? (if not, the name should change)
 - compression options make sense for both text and string fields...
perhaps it should just be added there.
 - if you store term vectors for longer fields, shouldn't you just
store them for all fields (the longer ones will presumably take up the
bulk of the index anyway)

Regarding term vectors... like some other field properties, they are
per-field and not per-field-instance (so you can't turn it on for some
and off for others).  On document retrieval, I think one would detect
that term vectors were stored, but one wouldn't get back any terms (I
haven't tried this though).  I doubt the highlighter handles this
case.

-Yonik

On 8/31/06, Mike Klaas <[hidden email]> wrote:

> Hi,
>
> I was thinking of enabling the compressed=True field option (which
> currently has no effect), as compression is important for highlighting
> large fields (since they must be stored).
>
> However, rather than exposing a lucene implementation detail, I
> decided to create a FieldType which dynamically chooses to compress
> and/or term-vector a field depending on the field length (configurable
> in the field type).
>
> Any objections to commiting this?
>
> -Mike
>
> public class HighlitTextField extends TextField {
>   /* if field size (in characters) is greater than this threshold, the field
>      will be stored compressed */
>   public static int DEFAULT_COMPRESS_THRESHOLD = 200;
>   /* if field size (in characters) is greater than this threshold, the field
>      will have term vector data stored */
>   public static int DEFAULT_TERMVEC_THRESHOLD = 500;
>
>   int compressThreshold;
>   int termVecThreshold;
>
>   private static String CT = "compressThreshold";
>   private static String TV = "termVecThreshold";
>
>   protected void init(IndexSchema schema, Map<String,String> args) {
>     SolrParams p = new MapSolrParams(args);
>     compressThreshold = p.getInt(CT, DEFAULT_COMPRESS_THRESHOLD);
>     termVecThreshold = p.getInt(TV, DEFAULT_TERMVEC_THRESHOLD);
>     for(String prop: new String[]{CT, TV})
>       args.remove(prop);
>     super.init(schema, args);
>   }
>
>     /* Helpers for field construction */
>   protected Field.TermVector getFieldTermVec(SchemaField field,
>                                              String internalVal) {
>     /* store all termvec data if field length exceeds threshold */
>     return internalVal.length() >= termVecThreshold ?
>       Field.TermVector.WITH_POSITIONS_OFFSETS : Field.TermVector.NO;
>   }
>   protected Field.Store getFieldStore(SchemaField field,
>                                       String internalVal) {
>     /* compress field if length exceeds threshold */
>     return internalVal.length() >= compressThreshold ?
>       Field.Store.COMPRESS : Field.Store.YES;
>
>   }
> }
Reply | Threaded
Open this post in threaded view
|

Re: Field compression and storage optimization

Mike Klaas
On 9/1/06, Yonik Seeley <[hidden email]> wrote:
> A couple of thoughts...
>  - should this be specific to highlighting? (if not, the name should change)

Not necessarily--the reason I named it as such is that I had trouble
thinking of applications of only-sometimes sorting term vectors for a
field.  Though since I've misunderstood how term vectors work, the
only thing that remains is compression, which is more generally
applicable.

>  - compression options make sense for both text and string fields...
> perhaps it should just be added there.

That sounds ideal.  Perhaps a compressed=true/false with optional
compressionThreshold (default compress all)?

Should these types of parameters be overridable on a the
field-defintion level?  It is a bit difficult since field properties
are boolean and there would have to be some means of determining
whether a field property is set or not.

>  - if you store term vectors for longer fields, shouldn't you just
> store them for all fields (the longer ones will presumably take up the
> bulk of the index anyway)

True, it might make more sense to reverse the inequality.

> Regarding term vectors... like some other field properties, they are
> per-field and not per-field-instance (so you can't turn it on for some
> and off for others).  On document retrieval, I think one would detect
> that term vectors were stored, but one wouldn't get back any terms (I
> haven't tried this though).  I doubt the highlighter handles this
> case.

If they are per-field, does that mean that term-vectors are generated
for all documents for a field if only one document requests them?  If
so, there is little point to this optimization.

If not, however, the highlighting code currently works by attempting
term-vector retrieval and falling back on re-analysis, so I believe
that it should be fine.

-Mike

> On 8/31/06, Mike Klaas <[hidden email]> wrote:
> > Hi,
> >
> > I was thinking of enabling the compressed=True field option (which
> > currently has no effect), as compression is important for highlighting
> > large fields (since they must be stored).
> >
> > However, rather than exposing a lucene implementation detail, I
> > decided to create a FieldType which dynamically chooses to compress
> > and/or term-vector a field depending on the field length (configurable
> > in the field type).
> >
> > Any objections to commiting this?
> >
> > -Mike
> >
> > public class HighlitTextField extends TextField {
> >   /* if field size (in characters) is greater than this threshold, the field
> >      will be stored compressed */
> >   public static int DEFAULT_COMPRESS_THRESHOLD = 200;
> >   /* if field size (in characters) is greater than this threshold, the field
> >      will have term vector data stored */
> >   public static int DEFAULT_TERMVEC_THRESHOLD = 500;
> >
> >   int compressThreshold;
> >   int termVecThreshold;
> >
> >   private static String CT = "compressThreshold";
> >   private static String TV = "termVecThreshold";
> >
> >   protected void init(IndexSchema schema, Map<String,String> args) {
> >     SolrParams p = new MapSolrParams(args);
> >     compressThreshold = p.getInt(CT, DEFAULT_COMPRESS_THRESHOLD);
> >     termVecThreshold = p.getInt(TV, DEFAULT_TERMVEC_THRESHOLD);
> >     for(String prop: new String[]{CT, TV})
> >       args.remove(prop);
> >     super.init(schema, args);
> >   }
> >
> >     /* Helpers for field construction */
> >   protected Field.TermVector getFieldTermVec(SchemaField field,
> >                                              String internalVal) {
> >     /* store all termvec data if field length exceeds threshold */
> >     return internalVal.length() >= termVecThreshold ?
> >       Field.TermVector.WITH_POSITIONS_OFFSETS : Field.TermVector.NO;
> >   }
> >   protected Field.Store getFieldStore(SchemaField field,
> >                                       String internalVal) {
> >     /* compress field if length exceeds threshold */
> >     return internalVal.length() >= compressThreshold ?
> >       Field.Store.COMPRESS : Field.Store.YES;
> >
> >   }
> > }
>
Reply | Threaded
Open this post in threaded view
|

Re: Field compression and storage optimization

Yonik Seeley-2
On 9/1/06, Mike Klaas <[hidden email]> wrote:

> On 9/1/06, Yonik Seeley <[hidden email]> wrote:
> > A couple of thoughts...
> >  - should this be specific to highlighting? (if not, the name should change)
>
> Not necessarily--the reason I named it as such is that I had trouble
> thinking of applications of only-sometimes sorting term vectors for a
> field.  Though since I've misunderstood how term vectors work, the
> only thing that remains is compression, which is more generally
> applicable.
>
> >  - compression options make sense for both text and string fields...
> > perhaps it should just be added there.
>
> That sounds ideal.  Perhaps a compressed=true/false with optional
> compressionThreshold (default compress all)?
>
> Should these types of parameters be overridable on a the
> field-defintion level?  It is a bit difficult since field properties
> are boolean and there would have to be some means of determining
> whether a field property is set or not.
>
> >  - if you store term vectors for longer fields, shouldn't you just
> > store them for all fields (the longer ones will presumably take up the
> > bulk of the index anyway)
>
> True, it might make more sense to reverse the inequality.
>
> > Regarding term vectors... like some other field properties, they are
> > per-field and not per-field-instance (so you can't turn it on for some
> > and off for others).  On document retrieval, I think one would detect
> > that term vectors were stored, but one wouldn't get back any terms (I
> > haven't tried this though).  I doubt the highlighter handles this
> > case.
>
> If they are per-field, does that mean that term-vectors are generated
> for all documents for a field if only one document requests them?  If
> so, there is little point to this optimization.

No, term vectors are only generated for fields where it is explicitly set.
But, the per-segment FieldInfos keeps track of "indexed", "omitNorms"
and "termVectors" on a per-fieldname basis.  When segments are merged,
of one segment doesn't have termvectors stored and another segment
does, the entire new segment is marked as having termvectors.

> If not, however, the highlighting code currently works by attempting
> term-vector retrieval and falling back on re-analysis, so I believe
> that it should be fine.

I haven't tried it, but I think it might be impossible to tell an
empty field with termvectors stored from a field without termvectors
that was "promoted" to having termvectors.  I think
reader.getTermFreqVector(docId,field) may *not* return null in the
latter case.
Anyone know for sure?

-Yonik