storing term text internally as byte array and bytecount as prefix, etc.

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

storing term text internally as byte array and bytecount as prefix, etc.

jian chen
Hi, All,

Recently I have been following through the whole discussion on storing
text/string as standard UTF-8 and how to achieve that in Lucene.

If we are stroing the term text and the field strings as UTF-8 bytes, I now
understand that it is a tricky issue because of the performance problem we
are still facing when converting back and forth between the UTF-8 bytes and
java String. This especially seems to be a problem for the segment merger
routine, which loads the segment term enums and will convert the UTF-8 bytes
back to String during merge operation.

Just a thought here, could we always represent the term text as UTF-8 bytes
internally? So Term.java will have the private member variable:

private byte[] utf8bytes;

instead of

private String text;

Plus, Term object could be construct either from a String or from a utf8
byte array.

This way, for indexing new documents, the new Term(String text) is called
and utf8bytes will be obtained from the input term text. For segment term
info merge, the utf8bytes will be loaded from the Lucene index, which
already stores the term text as utf8 bytes. Therefore, no conversion is
needed.

I hope I explained my thoughts. Make sense?

Cheers,

Jian Chen
Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

Marvin Humphrey
On May 1, 2006, at 6:27 PM, jian chen wrote:

> This way, for indexing new documents, the new Term(String text) is  
> called
> and utf8bytes will be obtained from the input term text. For  
> segment term
> info merge, the utf8bytes will be loaded from the Lucene index, which
> already stores the term text as utf8 bytes. Therefore, no  
> conversion is
> needed.

SegmentMerger will have to change to use bytes if bytecount-based  
string header is going to achieve acceptable performace.  Doug  
pointed that out when I was about to throw in the towel because I  
couldn't get things fast enough.  Changing the implementation of Term  
would have a very broad impact; I'd look for other ways to go about  
it first.  But I'm not an expert on SegmentMerger, as KinoSearch  
doesn't use the same technique for merging.

My plan was to first submit a patch that made the change to the file  
format but didn't touch SegmentMerger, then attack SegmentMerger and  
also see if other developers could suggest optimizations.

However, I have an awful lot on my plate right now, and I basically  
get paid to do KinoSearch-related work, but not Lucene-related work.  
It's hard for me to break out the time to do the java coding,  
especially since I don't have that much experience with java and I'm  
slow.  I'm not sure how soon I'll be able to get back to those  
bytecount patches.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

jian chen
Hi, Marvin,

Thanks for your quick response. I am in the camp of fearless refactoring,
even at the expense of breaking compatibility with previous releases. ;-)

Compatibility aside, I am trying to identify if changing the implementation
of Term is the right way to go for this problem.

If it is, I think it would be worthwhile rather than putting band-aid on the
existing API.

Cheers,

Jian

Changing the implementation of Term

> would have a very broad impact; I'd look for other ways to go about
> it first.  But I'm not an expert on SegmentMerger, as KinoSearch
> doesn't use the same technique for merging.
>
> My plan was to first submit a patch that made the change to the file
> format but didn't touch SegmentMerger, then attack SegmentMerger and
> also see if other developers could suggest optimizations.
>
> However, I have an awful lot on my plate right now, and I basically
> get paid to do KinoSearch-related work, but not Lucene-related work.
> It's hard for me to break out the time to do the java coding,
> especially since I don't have that much experience with java and I'm
> slow.  I'm not sure how soon I'll be able to get back to those
> bytecount patches.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

Chuck Williams-2
Could someone summarize succinctly why it is considered a major issue
that Lucene uses the Java modified UTF-8 encoding within its index
rather than the standard UTF-8 encoding.  Is the only concern
compatibility with index formats in other Lucene variants?  The API to
the values is a String, which uses Java's char representation, so I'm
confused why the encoding in the index is so important.

One possible benefit of a standard UTF-8 index encoding would be
streaming content into and out of the index with no copying or
conversions.  This relates to the lazy field loading mechanism.

Thanks for any clarification,

Chuck


jian chen wrote on 05/01/2006 04:24 PM:

> Hi, Marvin,
>
> Thanks for your quick response. I am in the camp of fearless refactoring,
> even at the expense of breaking compatibility with previous releases. ;-)
>
> Compatibility aside, I am trying to identify if changing the
> implementation
> of Term is the right way to go for this problem.
>
> If it is, I think it would be worthwhile rather than putting band-aid
> on the
> existing API.
>
> Cheers,
>
> Jian
>
> Changing the implementation of Term
>> would have a very broad impact; I'd look for other ways to go about
>> it first.  But I'm not an expert on SegmentMerger, as KinoSearch
>> doesn't use the same technique for merging.
>>
>> My plan was to first submit a patch that made the change to the file
>> format but didn't touch SegmentMerger, then attack SegmentMerger and
>> also see if other developers could suggest optimizations.
>>
>> However, I have an awful lot on my plate right now, and I basically
>> get paid to do KinoSearch-related work, but not Lucene-related work.
>> It's hard for me to break out the time to do the java coding,
>> especially since I don't have that much experience with java and I'm
>> slow.  I'm not sure how soon I'll be able to get back to those
>> bytecount patches.
>>
>> Marvin Humphrey
>> Rectangular Research
>> http://www.rectangular.com/
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

jian chen
Hi, Chuck,

Using standard UTF-8 is very important for Lucene index so any program could
read the Lucene index easily, be it written in perl, c/c++ or any new future
programming languages.

It is like storing data in a database for web application. You want to store
it in such a way that other programs can manipulate easily other than only
the web app program. Because there will be cases that you want to mass
update or mass change the data, and you don't want to write only web apps
for doing it, right?

Cheers,

Jian


On 5/1/06, Chuck Williams <[hidden email]> wrote:

>
> Could someone summarize succinctly why it is considered a major issue
> that Lucene uses the Java modified UTF-8 encoding within its index
> rather than the standard UTF-8 encoding.  Is the only concern
> compatibility with index formats in other Lucene variants?  The API to
> the values is a String, which uses Java's char representation, so I'm
> confused why the encoding in the index is so important.
>
> One possible benefit of a standard UTF-8 index encoding would be
> streaming content into and out of the index with no copying or
> conversions.  This relates to the lazy field loading mechanism.
>
> Thanks for any clarification,
>
> Chuck
>
>
> jian chen wrote on 05/01/2006 04:24 PM:
> > Hi, Marvin,
> >
> > Thanks for your quick response. I am in the camp of fearless
> refactoring,
> > even at the expense of breaking compatibility with previous releases.
> ;-)
> >
> > Compatibility aside, I am trying to identify if changing the
> > implementation
> > of Term is the right way to go for this problem.
> >
> > If it is, I think it would be worthwhile rather than putting band-aid
> > on the
> > existing API.
> >
> > Cheers,
> >
> > Jian
> >
> > Changing the implementation of Term
> >> would have a very broad impact; I'd look for other ways to go about
> >> it first.  But I'm not an expert on SegmentMerger, as KinoSearch
> >> doesn't use the same technique for merging.
> >>
> >> My plan was to first submit a patch that made the change to the file
> >> format but didn't touch SegmentMerger, then attack SegmentMerger and
> >> also see if other developers could suggest optimizations.
> >>
> >> However, I have an awful lot on my plate right now, and I basically
> >> get paid to do KinoSearch-related work, but not Lucene-related work.
> >> It's hard for me to break out the time to do the java coding,
> >> especially since I don't have that much experience with java and I'm
> >> slow.  I'm not sure how soon I'll be able to get back to those
> >> bytecount patches.
> >>
> >> Marvin Humphrey
> >> Rectangular Research
> >> http://www.rectangular.com/
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

jian chen
Plus, as open source and open standard advocates, we don't want to be like
Micros$ft, who claims to use industrial "standard" XML as the next
generation word file format. However, it is very hard to write your own Word
reader, because their word file format is proprietary and hard to write
programs for.

Jian

On 5/1/06, jian chen <[hidden email]> wrote:

>
> Hi, Chuck,
>
> Using standard UTF-8 is very important for Lucene index so any program
> could read the Lucene index easily, be it written in perl, c/c++ or any new
> future programming languages.
>
> It is like storing data in a database for web application. You want to
> store it in such a way that other programs can manipulate easily other than
> only the web app program. Because there will be cases that you want to mass
> update or mass change the data, and you don't want to write only web apps
> for doing it, right?
>
> Cheers,
>
> Jian
>
>
>
> On 5/1/06, Chuck Williams <[hidden email]> wrote:
> >
> > Could someone summarize succinctly why it is considered a major issue
> > that Lucene uses the Java modified UTF-8 encoding within its index
> > rather than the standard UTF-8 encoding.  Is the only concern
> > compatibility with index formats in other Lucene variants?  The API to
> > the values is a String, which uses Java's char representation, so I'm
> > confused why the encoding in the index is so important.
> >
> > One possible benefit of a standard UTF-8 index encoding would be
> > streaming content into and out of the index with no copying or
> > conversions.  This relates to the lazy field loading mechanism.
> >
> > Thanks for any clarification,
> >
> > Chuck
> >
> >
> > jian chen wrote on 05/01/2006 04:24 PM:
> > > Hi, Marvin,
> > >
> > > Thanks for your quick response. I am in the camp of fearless
> > refactoring,
> > > even at the expense of breaking compatibility with previous releases.
> > ;-)
> > >
> > > Compatibility aside, I am trying to identify if changing the
> > > implementation
> > > of Term is the right way to go for this problem.
> > >
> > > If it is, I think it would be worthwhile rather than putting band-aid
> > > on the
> > > existing API.
> > >
> > > Cheers,
> > >
> > > Jian
> > >
> > > Changing the implementation of Term
> > >> would have a very broad impact; I'd look for other ways to go about
> > >> it first.  But I'm not an expert on SegmentMerger, as KinoSearch
> > >> doesn't use the same technique for merging.
> > >>
> > >> My plan was to first submit a patch that made the change to the file
> > >> format but didn't touch SegmentMerger, then attack SegmentMerger and
> > >> also see if other developers could suggest optimizations.
> > >>
> > >> However, I have an awful lot on my plate right now, and I basically
> > >> get paid to do KinoSearch-related work, but not Lucene-related work.
> > >> It's hard for me to break out the time to do the java coding,
> > >> especially since I don't have that much experience with java and I'm
> > >> slow.  I'm not sure how soon I'll be able to get back to those
> > >> bytecount patches.
> > >>
> > >> Marvin Humphrey
> > >> Rectangular Research
> > >> http://www.rectangular.com/
> > >>
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

Chuck Williams-2
Hi Jian,

I agree with you about Microsoft.  It's a standard ploy to put window
dressing on stuff to combat competition, in this case from the open
document standard.

So the UTF-8 concern is interoperability with other programs at the
index level.  An interesting question here is whether the Lucene index
format should be considered an API for this purpose.  Most software
systems choose not to publish their internal representations due to
upward compatibility and other concerns, choosing to provide an API
instead.  This is true for databases as well, for example.

Lucene does publish the index formats, so perhaps they are supposed to
be a full API.  If so, it seems to be to not be difficult to use an
analogous few lines of code in other languages as Lucene uses in
IndexInput.readChars() and IndexOutput.writeChars().  The main Lucene
code tree is in Java, which uses the modified UTF-8 encoding.
Compatibility with that seems most important to me.

One benefits of using Java's modified encoding is to be able to
determine the number of encoded characters from the length of a String
prior to encoding it.  Without this, writeString() would need to write
the string length after writing the encoded characters, which would
involve two extra seeks and could slow things down considerably, unless
there is a different approach that avoids this.

For lazy fields, there would be a substantial benefit to having the
count on a String be an encoded byte count rather than a Java char
count, but this has the same problem.  If there is a way to beat this
problem, then I'd start arguing for a byte count.

Chuck


jian chen wrote on 05/01/2006 06:23 PM:

> Plus, as open source and open standard advocates, we don't want to be
> like
> Micros$ft, who claims to use industrial "standard" XML as the next
> generation word file format. However, it is very hard to write your
> own Word
> reader, because their word file format is proprietary and hard to write
> programs for.
>
> Jian
>
> On 5/1/06, jian chen <[hidden email]> wrote:
>>
>> Hi, Chuck,
>>
>> Using standard UTF-8 is very important for Lucene index so any program
>> could read the Lucene index easily, be it written in perl, c/c++ or
>> any new
>> future programming languages.
>>
>> It is like storing data in a database for web application. You want to
>> store it in such a way that other programs can manipulate easily
>> other than
>> only the web app program. Because there will be cases that you want
>> to mass
>> update or mass change the data, and you don't want to write only web
>> apps
>> for doing it, right?
>>
>> Cheers,
>>
>> Jian
>>
>>
>>
>> On 5/1/06, Chuck Williams <[hidden email]> wrote:
>> >
>> > Could someone summarize succinctly why it is considered a major issue
>> > that Lucene uses the Java modified UTF-8 encoding within its index
>> > rather than the standard UTF-8 encoding.  Is the only concern
>> > compatibility with index formats in other Lucene variants?  The API to
>> > the values is a String, which uses Java's char representation, so I'm
>> > confused why the encoding in the index is so important.
>> >
>> > One possible benefit of a standard UTF-8 index encoding would be
>> > streaming content into and out of the index with no copying or
>> > conversions.  This relates to the lazy field loading mechanism.
>> >
>> > Thanks for any clarification,
>> >
>> > Chuck
>> >
>> >
>> > jian chen wrote on 05/01/2006 04:24 PM:
>> > > Hi, Marvin,
>> > >
>> > > Thanks for your quick response. I am in the camp of fearless
>> > refactoring,
>> > > even at the expense of breaking compatibility with previous
>> releases.
>> > ;-)
>> > >
>> > > Compatibility aside, I am trying to identify if changing the
>> > > implementation
>> > > of Term is the right way to go for this problem.
>> > >
>> > > If it is, I think it would be worthwhile rather than putting
>> band-aid
>> > > on the
>> > > existing API.
>> > >
>> > > Cheers,
>> > >
>> > > Jian
>> > >
>> > > Changing the implementation of Term
>> > >> would have a very broad impact; I'd look for other ways to go about
>> > >> it first.  But I'm not an expert on SegmentMerger, as KinoSearch
>> > >> doesn't use the same technique for merging.
>> > >>
>> > >> My plan was to first submit a patch that made the change to the
>> file
>> > >> format but didn't touch SegmentMerger, then attack SegmentMerger
>> and
>> > >> also see if other developers could suggest optimizations.
>> > >>
>> > >> However, I have an awful lot on my plate right now, and I basically
>> > >> get paid to do KinoSearch-related work, but not Lucene-related
>> work.
>> > >> It's hard for me to break out the time to do the java coding,
>> > >> especially since I don't have that much experience with java and
>> I'm
>> > >> slow.  I'm not sure how soon I'll be able to get back to those
>> > >> bytecount patches.
>> > >>
>> > >> Marvin Humphrey
>> > >> Rectangular Research
>> > >> http://www.rectangular.com/
>> > >>
>> > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>> >
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

Doug Cutting
Chuck Williams wrote:
> For lazy fields, there would be a substantial benefit to having the
> count on a String be an encoded byte count rather than a Java char
> count, but this has the same problem.  If there is a way to beat this
> problem, then I'd start arguing for a byte count.

I think the way to beat it is to keep things as bytes as long as
possible.  For example, each term in a Query needs to be converted from
String to byte[], but after that all search computation could happen
comparing byte arrays.  (Note that lexicographic comparisons of UTF-8
encoded bytes give the same results as lexicographic comparisions of
Unicode character strings.)  And, when indexing, each Token would need
to be converted from String to byte[] just once.

The Java API can easily be made back-compatible.  The harder part would
be making the file format back-compatible.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

jian chen
Hi, Doug,

I totally agree with what you said. Yeah, I think it is more of a file
format issue, less of an API issue. It seems that we just need to add an
extra constructor to Term.java to take in utf8 byte array.

Lucene 2.0 is going to break the backward compability anyway, right? So,
maybe this change to standard UTF-8 could be a hot item on the Lucene 2.0list?

Cheers,

Jian Chen

On 5/2/06, Doug Cutting <[hidden email]> wrote:

>
> Chuck Williams wrote:
> > For lazy fields, there would be a substantial benefit to having the
> > count on a String be an encoded byte count rather than a Java char
> > count, but this has the same problem.  If there is a way to beat this
> > problem, then I'd start arguing for a byte count.
>
> I think the way to beat it is to keep things as bytes as long as
> possible.  For example, each term in a Query needs to be converted from
> String to byte[], but after that all search computation could happen
> comparing byte arrays.  (Note that lexicographic comparisons of UTF-8
> encoded bytes give the same results as lexicographic comparisions of
> Unicode character strings.)  And, when indexing, each Token would need
> to be converted from String to byte[] just once.
>
> The Java API can easily be made back-compatible.  The harder part would
> be making the file format back-compatible.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

Chuck Williams-2
The benefits to a byte count are substantial, including:

   1. Lazy fields can skip strings without reading them, as they do for
      all other value types.
   2. The file format could be changed to standard UTF-8 without any
      significant performance cost
   3. Any other index operation that relies on the index format will
      have an easier time with a representation that is a) easy to
      quickly scan and b) consistent (all value types start with a byte
      count).

Re. 3, Jian is concerned about programs in other languages that
manipulate Lucene index files.  I have such a program in Java and face
the same issue.  My case is a robust and general implementation of
IndexUpdater that copies segments transforming field values, updating
both stored values and postings (not yet term vectors).  It is optimized
to skip (copy) and/or minimally process unchanged areas, which are
typically most areas.  This process is slowed when processing unchanged
stored String values due to the current char count representation -- it
faces precisely the same issue as the lazy fields mechanism.

Re. the file format compatibility issue, if backward compatibility is a
requirement here, then it would seem to be necessary to have a
configuration option to choose the encoding of stored strings.  It seems
easy to generalize the Lucene API's to specify an interface for any
desired encode/decode.

Chuck


jian chen wrote on 05/02/2006 08:15 AM:

> Hi, Doug,
>
> I totally agree with what you said. Yeah, I think it is more of a file
> format issue, less of an API issue. It seems that we just need to add an
> extra constructor to Term.java to take in utf8 byte array.
>
> Lucene 2.0 is going to break the backward compability anyway, right? So,
> maybe this change to standard UTF-8 could be a hot item on the Lucene
> 2.0list?
>
> Cheers,
>
> Jian Chen
>
> On 5/2/06, Doug Cutting <[hidden email]> wrote:
>>
>> Chuck Williams wrote:
>> > For lazy fields, there would be a substantial benefit to having the
>> > count on a String be an encoded byte count rather than a Java char
>> > count, but this has the same problem.  If there is a way to beat this
>> > problem, then I'd start arguing for a byte count.
>>
>> I think the way to beat it is to keep things as bytes as long as
>> possible.  For example, each term in a Query needs to be converted from
>> String to byte[], but after that all search computation could happen
>> comparing byte arrays.  (Note that lexicographic comparisons of UTF-8
>> encoded bytes give the same results as lexicographic comparisions of
>> Unicode character strings.)  And, when indexing, each Token would need
>> to be converted from String to byte[] just once.
>>
>> The Java API can easily be made back-compatible.  The harder part would
>> be making the file format back-compatible.
>>
>> Doug
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

Tatu Saloranta
In reply to this post by jian chen
--- jian chen <[hidden email]> wrote:

> Plus, as open source and open standard advocates, we
> don't want to be like
> Micros$ft, who claims to use industrial "standard"
> XML as the next
> generation word file format. However, it is very
> hard to write your own Word
> reader, because their word file format is
> proprietary and hard to write
> programs for.

Note, though, that "java modified UTF-8" IS the
standard on Java platform (and there are valid reasons
for it using slightly different encoding from
canonical one); so changing it in any way would make
it less standard, not more (within context of java
platform).

Second, unless I'm mistaken, there is nothing special
in java encoding that would make it a problem with,
say, C/C++ implementation. I thought Perl had some
specific problems, since its UTF-8 support is more
hard-coded; whereas it is possible (and not very
difficult) to change char<->UTF-8 serialization, it's
not quite as easy in Perl (at least doing it with any
reasonable efficiency).

I would actually be more interested in other
performance aspects of avoiding String instantiation:
managing byte arrays directly, and/or using canonical
caching from byte[] directly to Strings can bring
significant performance improvements when
serializing/deserializing tokens to/from disk; at
likely expense of bit more memory usage (Term object
probably should have lazily instantiated
String/byte[], depending on how it was created).

-+ Tatu +-

>
> Jian
>
> On 5/1/06, jian chen <[hidden email]> wrote:
> >
> > Hi, Chuck,
> >
> > Using standard UTF-8 is very important for Lucene
> index so any program
> > could read the Lucene index easily, be it written
> in perl, c/c++ or any new
> > future programming languages.
> >
> > It is like storing data in a database for web
> application. You want to
> > store it in such a way that other programs can
> manipulate easily other than
> > only the web app program. Because there will be
> cases that you want to mass
> > update or mass change the data, and you don't want
> to write only web apps
> > for doing it, right?
> >
> > Cheers,
> >
> > Jian
> >
> >
> >
> > On 5/1/06, Chuck Williams <[hidden email]>
> wrote:
> > >
> > > Could someone summarize succinctly why it is
> considered a major issue
> > > that Lucene uses the Java modified UTF-8
> encoding within its index
> > > rather than the standard UTF-8 encoding.  Is the
> only concern
> > > compatibility with index formats in other Lucene
> variants?  The API to
> > > the values is a String, which uses Java's char
> representation, so I'm
> > > confused why the encoding in the index is so
> important.
> > >
> > > One possible benefit of a standard UTF-8 index
> encoding would be
> > > streaming content into and out of the index with
> no copying or
> > > conversions.  This relates to the lazy field
> loading mechanism.
> > >
> > > Thanks for any clarification,
> > >
> > > Chuck
> > >
> > >
> > > jian chen wrote on 05/01/2006 04:24 PM:
> > > > Hi, Marvin,
> > > >
> > > > Thanks for your quick response. I am in the
> camp of fearless
> > > refactoring,
> > > > even at the expense of breaking compatibility
> with previous releases.
> > > ;-)
> > > >
> > > > Compatibility aside, I am trying to identify
> if changing the
> > > > implementation
> > > > of Term is the right way to go for this
> problem.
> > > >
> > > > If it is, I think it would be worthwhile
> rather than putting band-aid
> > > > on the
> > > > existing API.
> > > >
> > > > Cheers,
> > > >
> > > > Jian
> > > >
> > > > Changing the implementation of Term
> > > >> would have a very broad impact; I'd look for
> other ways to go about
> > > >> it first.  But I'm not an expert on
> SegmentMerger, as KinoSearch
> > > >> doesn't use the same technique for merging.
> > > >>
> > > >> My plan was to first submit a patch that made
> the change to the file
> > > >> format but didn't touch SegmentMerger, then
> attack SegmentMerger and
> > > >> also see if other developers could suggest
> optimizations.
> > > >>
> > > >> However, I have an awful lot on my plate
> right now, and I basically
> > > >> get paid to do KinoSearch-related work, but
> not Lucene-related work.
> > > >> It's hard for me to break out the time to do
> the java coding,
> > > >> especially since I don't have that much
> experience with java and I'm
> > > >> slow.  I'm not sure how soon I'll be able to
> get back to those
> > > >> bytecount patches.
> > > >>
> > > >> Marvin Humphrey
> > > >> Rectangular Research
> > > >> http://www.rectangular.com/
> > > >>
> > > >
> > >
> > >
> > >
>
---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> [hidden email]
> > > For additional commands, e-mail:
> [hidden email]
> > >
> > >
> >
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

Marvin Humphrey
In reply to this post by Chuck Williams-2

On May 1, 2006, at 7:33 PM, Chuck Williams wrote:
 > Could someone summarize succinctly why it is considered a
 > major issue that Lucene uses the Java modified UTF-8
 > encoding within its index rather than the standard UTF-8
 > encoding.  Is the only concern compatibility with index
 > formats in other Lucene variants?

I originally raised a stink about "Modified UTF-8" because at the  
time I was embroiled in an effort to implement the Lucene file  
format, and the
Lucene File Formats document claimed to be using "UTF-8", straight  
up.  It was most unpleasant to discover that if my app read legal  
UTF-8, Lucene-generated indexes would cause it to crash from time to  
time, and that if it wrote legal UTF-8, the indexes it generated  
would cause Lucene to crash from time to time.

More problematic than the "Modified UTF-8" actually, is the  
definition of a Lucene String.   According to the File Formats  
document, "Lucene writes strings as a VInt representing the length,  
followed by the character data."  The word "length" is ambiguous in  
that context, and at first I took it to mean either length in Unicode  
code points or bytes.  It was a nasty shock to discover that it was  
actually Java chars.  Bizarre and painful contortions were suddenly  
required for encoding/decoding a term dictionary which would  
otherwise have been completely unnecessary.

I used to think that the Lucene file format might serve as "the TIFF  
of inverted indexes".  My perspective on this has changed.  Lucene's  
file format is just beastly difficult to implement from scratch, and  
anything short of full implementation guarantees occasional "Read  
past EOF" errors on interchange.  Personally, I would assess the file  
format as the secondary expression of a beautiful algorithmic  
design.  Ease of interchange and ease of implementation do not seem  
to have been primary design considerations -- which is perfectly  
reasonable, if true, but perhaps then it should not aspire to serve  
as a vehicle for interchange.  As was asserted in the recent thread  
on ACID compliance, the indexes produced by a full-text indexer are  
not meant to serve as primary document storage.  It's common to need  
to move a TIFF or a text file from system to system.  It's not common  
to need to move a derived index.

Compatibility has its advantages.  It was pretty nice to be able to  
browse KinoSearch-generated indexes using Luke, once I managed to  
achieve compatibility for all-ascii source material.  But holy crow,  
was it tough to debug those indexes.  No human readable components.  
No fixed block sizes.  No facilities for resyncing a stream once it  
gets off.  All that on top of the "Modified UTF-8" and the String  
definition.

At this point I think the suggestion of turning the File Formats  
document from an ostensible spec into a piece of ordinary  
documentation is a worthy one.  FWIW, I've pretty much given up on  
the idea of making KinoSearch and Lucene file-format-compatible.  In  
my weaker moments I imagine that I might sell the Lucene community on  
the changes that would be necessary.  Then I remember that many of  
you live in a world where "Modified UTF-8" isn't an abomination.  ;)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: storing term text internally as byte array and bytecount as prefix, etc.

Doug Cutting
Marvin Humphrey wrote:
> More problematic than the "Modified UTF-8" actually, is the definition
> of a Lucene String.   According to the File Formats document, "Lucene
> writes strings as a VInt representing the length, followed by the
> character data."  The word "length" is ambiguous in that context, and at
> first I took it to mean either length in Unicode code points or bytes.  
> It was a nasty shock to discover that it was actually Java chars.  
> Bizarre and painful contortions were suddenly required for
> encoding/decoding a term dictionary which would otherwise have been
> completely unnecessary.

Yes, this should be corrected.  The problem is that "length" refers to
the length of the Java string, but that is not explicit.  Moreover, as
you have pointed out, that is a bad choice for non-Java implementations.

> Ease of
> interchange and ease of implementation do not seem to have been primary
> design considerations -- which is perfectly reasonable, if true, but
> perhaps then it should not aspire to serve as a vehicle for
> interchange.

The index format document was written years after Lucene was written,
after Lucene had alredy been ported to other languages.  It seemed like
a good idea to document what folks were porting.  Ease of interchange
and implementation were not primary considerations when Lucene was
developed.  That said, at the time Lucene was first written (1997),
Unicode was only 16-bit and there was no discrepancy between Java's
modified encoding and UTF-8.

> At this point I think the suggestion of turning the File Formats
> document from an ostensible spec into a piece of ordinary documentation
> is a worthy one.  FWIW, I've pretty much given up on the idea of making
> KinoSearch and Lucene file-format-compatible.  In my weaker moments I
> imagine that I might sell the Lucene community on the changes that would
> be necessary.

Please do.  But suggestions without working patches are not always acted
on.  Most of us are busy with other projects, and only advance Lucene
when we have a need, or someone provides a patch.  Ideally we need to
find someone who *needs* an index format that's easily interchangeable
between Java and other languages to push this forward.

> Then I remember that many of you live in a world where
> "Modified UTF-8" isn't an abomination.  ;)

Modified UTF-8 is not anyone's choice.  It's simply what's used by Java.
  What are we supposed to do, picket Sun?  If we move to make Lucene's
file format an interchange format, then we must clearly move beyond it.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]