VInt's as prefix. Was: bytecount as prefix

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

VInt's as prefix. Was: bytecount as prefix

Ben van Klinken
Hi,

I'm the author of CLucene (a c++ port of lucene). I've been following
the 'using byte count as prefix' discussion and I think this
discussion sort of ties into something we are trying to achieve.

We are trying to optimise the way the index writing works, and we also
want to be able to index & store fields which are using a Reader
object.

The second part is in theory a very easy solution, we can use a
streamfilter to buffer the reads that the analyser makes, and
integrate the FieldsWriter into the invertDocument function so that
the buffers are written while the analysers are run. Since there is no
way of knowing the length of the reader, we would then have to go back
and write the field length. Here is where the problem is, though: this
is not possible currently because we are using a VInt for the field
data length.

If we can use non variable length integers for the field data length
it makes it much easier for two things:

1) memory optimisations like the compressed field can benefit from
this: we don't have to store the entire compressed output in memory,
but can rather write it directly to the fields output stream.
2) it makes it possible to store AND index a field using a reader in a
single pass, thus removing the need to read twice (which might not
always be possible for some reader implementations).

The second feature is very important for us!

So I would like to propose a discussion on how this could be achieved:

My idea is to set a bit in the config like FIELD_DONT_USE_VINT. I dont
think using a static Int for every field is necessary, this few extra
(unnecessary) bytes for each field would add up to a lot. A static Int
is only used when completely necessary, and the implementation could
decide when to use this.

These are the rough changes that i think would need to be made:

final Document doc(int n) throws IOException {
...
        byte bits = fieldsStream.readByte();
        boolean dontUseVint = (bits & FieldsWriter.FIELD_DONT_USE_VINT) != 0;
...
        <<Binary fields like compressed or binary is an easy change...>>
        if ((bits & FieldsWriter.FIELD_IS_BINARY) != 0) {
                final byte[] b = new byte[dontUseVint?
                        fieldsStream.readInt():
                        fieldsStream.readVInt()]; << CHANGE HERE
...
        if (compressed) {
                final byte[] b = new byte[dontUseVint?
                        fieldsStream.readInt():
                        fieldsStream.readVInt()]; << CHANGE HERE
...
        <<Reading a field value as a string>>
        string value;
        if ( dontUseVint ){
                << I'm not completely sure about this section,
                        since changes relating to 'bytecount as prefix' would affect this >>
                int length = readInt();
            char[] chars = new char[length];
            readChars(chars, 0, length);
            value = new String(chars, 0, length);
        }else
                value = fieldsStream.readString()
       
        Field f = new Field(fi.name,     // name
            value, // read value  << CHANGE HERE - use different string length
            store,
            index,
            termVector);
...

Now is probably the best time to implement something like this before
lucene 2.0 is released. I think it wouldn't be a complicated change;
for now, we don't need to make any changes to the FieldWriter
(optimisations using this can be done later).

ben

On 5/7/06, Marvin Humphrey <[hidden email]> wrote:

> Got it.
>
> This was the problem, in TermInfosWriter.writeTerm():
>
> -    lastTerm = term;
> +    lastBytes = bytes;
>    }
>
> Without lastTerm being updated, the auxiliary term dictionary got
> screwed up.  This problem only manifested on large tests because small
> tests never moved past the first entry, which is always a field number
> of -1 and an empty string.
>
> I'll post a full working patch to JIRA as soon as I'm at a location
> where I can connect my laptop to the net.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

using jstreams

Thomas Gravel
Hi Ben,

I now tried to compile the contributions things,
but it failes in the highlighter Objects.
It failes in the Object Encoder.

I use the svn Revision 2063.

I tried to use the jstreams, but I don't know how.
Could you give me some example to add a field with content to a document.

string Content = "something in utf-8 text";
Document *doc= _CLNEW Document;
doc->add( *Field::Text("Content", Content.c_str(), true ) );

thx
bye
thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: VInt's as prefix. Was: bytecount as prefix

Marvin Humphrey
In reply to this post by Ben van Klinken

On May 11, 2006, at 3:24 AM, Ben van Klinken wrote:

> Here is where the problem is, though: this
> is not possible currently because we are using a VInt for the field
> data length.

What we really need is the ability to add "leading zeroes" to a VInt.

I believe that this is possible if we change the definition of VInt  
so that the high bytes are written first, rather than the low bytes.  
The "BER compressed integer", used by Perl's pack() function, is  
defined this way.  A proof-of-concept Perl script is below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#-------------------------------------------------
#!/usr/bin/perl
use strict;
use warnings;

my $pad        = pack( 'C', 128 );    # "leading zero": 1000 0000
my $serialized = pack( 'wwawaaw', 127, 128, $pad, 129, $pad, $pad,  
154 );

my @numbers = unpack( 'w*', $serialized );
print "@numbers\n";      # prints "127 128 129 154"

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: VInt's as prefix. Was: bytecount as prefix

Yonik Seeley
On 5/11/06, Marvin Humphrey <[hidden email]> wrote:
> I believe that this is possible if we change the definition of VInt
> so that the high bytes are written first, rather than the low bytes.
> The "BER compressed integer"

Great idea Marvin!  The decoding could be slightly faster with
reverse-byte order since you don't have to maintain a shift-count:

  public int readVInt() throws IOException {
    byte b = readByte();
    int i = b & 0x7F;
    while ((b & 0x80)!=0)
      b = readByte();
      i = (i<<7) | (b & 0x7F);
    }
    return i;
  }

Of course there is that *little* detail of backward compatability ;-)

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: VInt's as prefix. Was: bytecount as prefix

Marvin Humphrey

On May 11, 2006, at 8:02 AM, Yonik Seeley wrote:

> Of course there is that *little* detail of backward compatability ;-)

There is that.  :)

Between using bytecounts as String prefixes, transitioning from  
modified UTF-8 to standard UTF-8, and potentially changing the  
definition of VInt, there are a lot of backards incompatible changes  
looming for the I/O classes.

Maybe we should consider loading differing subclasses of IndexInput/
IndexOutput based on the detected file format version?  If this were  
C, I'd use function pointers.  What's the best way to approximate  
that in Java?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: VInt's as prefix. Was: bytecount as prefix

Yonik Seeley
On 5/11/06, Marvin Humphrey <[hidden email]> wrote:
> Maybe we should consider loading differing subclasses of IndexInput/
> IndexOutput based on the detected file format version?  If this were
> C, I'd use function pointers.  What's the best way to approximate
> that in Java?

Nothing but subclassing.

There are already different subclasses of IndexInput and IndexOutput.
The problem is, there are already 7 implementations of IndexInput, so
one would need to create 7 more implementations with different
readVInt() for example.

You could perhaps decouple and factor out part of the functionality
into a VIntReader and VIntWriter, for example, but readVInt() is
called *so* often, I'd be pretty afraid of the performance
implications.  1.5 HotSpot might be able to handle it... but then
there are people who need to use -client, people stuck on Java1.4,
etc.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: VInt's as prefix. Was: bytecount as prefix

Ben van Klinken
In reply to this post by Marvin Humphrey
> What we really need is the ability to add "leading zeroes" to a VInt.
I really like this idea! A VInt can then be written with a static length.

Then in clucene we can implement our stream optimisations without any
changes to the code logic.

What's the chance of this making it into Lucene 2.0? Let me know if
there's anything i can do to get this into Lucene 2.

cheers

ben

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: VInt's as prefix. Was: bytecount as prefix

Doug Cutting
Ben van Klinken wrote:
> What's the chance of this making it into Lucene 2.0? Let me know if
> there's anything i can do to get this into Lucene 2.

Lucene 2.0 is all but out the door.  We're talking about Lucene 2.x or
Lucene 3 here.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: VInt's as prefix. Was: bytecount as prefix

Ben van Klinken
Ok, haven't been following the 2.0 thing very well :)
But we at clucene are trying to get this stream thing going, so would
like to do something which will be compatible with java lucene.

So if there's something i can do with the refence version so that what
we are doing isn't incompatible, it would be a great help for us.

ben

On 5/11/06, Doug Cutting <[hidden email]> wrote:

> Ben van Klinken wrote:
> > What's the chance of this making it into Lucene 2.0? Let me know if
> > there's anything i can do to get this into Lucene 2.
>
> Lucene 2.0 is all but out the door.  We're talking about Lucene 2.x or
> Lucene 3 here.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]