Taking a step back

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Taking a step back

Grant Ingersoll
Is it just me or do we have a whole bunch of people proposing a whole
bunch of fairly broad changes to Lucene? (I know, I know, they should
always be backward compatible)  Might this warrant some
coordination/planning?  I know things are mostly done in an ad-hoc way
(whoever submits a patch), but I think we may all be better served by
some coordination beyond what takes place on the mailing list.  I see
some pieces here and there that would benefit from common code, etc.

As I see it, we have several people proposing file format changes, Otis
and some others want scoring changes, I have discussed with a few people
the ability to make more pluggable how fields are indexed plus the
ability to add metadata at all levels of the index (field, document,
index, etc.), more to come on this soon.  Additionally, the lazy loading
field stuff is pending and would benefit from a few file format changes
as well

Additionally, we can't just think of the Java version anymore,
especially when it comes to file formats, I don't think.  Should we,
perhaps, setup a top-level wiki-style planning place?  Would this be
useful?  I don't think it replaces the good discussions on this list, I
just think it could give potential contributors a much easier way of
finding how to help, plus set a (albeit loose) plan for the future of
Lucene beyond what is captured in snippets of email here and there.

Just my two cents,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Taking a step back

Robert Engels
I agree with almost all of what you said.

The file format issue whoever is a non-issue. If you want interoperability
between systems do it via remote invocation and IIOP, or some HTTP
interface. This is far more easier to control, especially through version
change cycles - otherwise all platforms need to be updated together - which
is very hard to do (unless you are using Java with WORA !).

I also don't understand why Lucene doesn't focus on being THE JAVA search
engine. Anything I think that detracts that from moving forward should be
out of scope.


-----Original Message-----
From: Grant Ingersoll [mailto:[hidden email]]
Sent: Wednesday, May 10, 2006 6:06 AM
To: Lucene Developer's List
Subject: Taking a step back


Is it just me or do we have a whole bunch of people proposing a whole
bunch of fairly broad changes to Lucene? (I know, I know, they should
always be backward compatible)  Might this warrant some
coordination/planning?  I know things are mostly done in an ad-hoc way
(whoever submits a patch), but I think we may all be better served by
some coordination beyond what takes place on the mailing list.  I see
some pieces here and there that would benefit from common code, etc.

As I see it, we have several people proposing file format changes, Otis
and some others want scoring changes, I have discussed with a few people
the ability to make more pluggable how fields are indexed plus the
ability to add metadata at all levels of the index (field, document,
index, etc.), more to come on this soon.  Additionally, the lazy loading
field stuff is pending and would benefit from a few file format changes
as well

Additionally, we can't just think of the Java version anymore,
especially when it comes to file formats, I don't think.  Should we,
perhaps, setup a top-level wiki-style planning place?  Would this be
useful?  I don't think it replaces the good discussions on this list, I
just think it could give potential contributors a much easier way of
finding how to help, plus set a (albeit loose) plan for the future of
Lucene beyond what is captured in snippets of email here and there.

Just my two cents,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Otis Gospodnetic-2
In reply to this post by Grant Ingersoll
You mean you want us to be more organized!?!? :)
I think a Wiki page like http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard might help.  Something like http://wiki.apache.org/jakarta-lucene/Lucene2.1Whiteboard

Otis

----- Original Message ----
From: Grant Ingersoll <[hidden email]>
To: Lucene Developer's List <[hidden email]>
Sent: Wednesday, May 10, 2006 7:05:30 AM
Subject: Taking a step back

Is it just me or do we have a whole bunch of people proposing a whole
bunch of fairly broad changes to Lucene? (I know, I know, they should
always be backward compatible)  Might this warrant some
coordination/planning?  I know things are mostly done in an ad-hoc way
(whoever submits a patch), but I think we may all be better served by
some coordination beyond what takes place on the mailing list.  I see
some pieces here and there that would benefit from common code, etc.

As I see it, we have several people proposing file format changes, Otis
and some others want scoring changes, I have discussed with a few people
the ability to make more pluggable how fields are indexed plus the
ability to add metadata at all levels of the index (field, document,
index, etc.), more to come on this soon.  Additionally, the lazy loading
field stuff is pending and would benefit from a few file format changes
as well

Additionally, we can't just think of the Java version anymore,
especially when it comes to file formats, I don't think.  Should we,
perhaps, setup a top-level wiki-style planning place?  Would this be
useful?  I don't think it replaces the good discussions on this list, I
just think it could give potential contributors a much easier way of
finding how to help, plus set a (albeit loose) plan for the future of
Lucene beyond what is captured in snippets of email here and there.

Just my two cents,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Erik Hatcher
speaking of the wiki, we need to move it to /lucene instead of the  
old /jakarta-lucene.  I've copied infrastructure on this message to  
find out what we need to do to shift it.

        Erik




On May 10, 2006, at 11:30 AM, Otis Gospodnetic wrote:

> You mean you want us to be more organized!?!? :)
> I think a Wiki page like http://wiki.apache.org/jakarta-lucene/ 
> Lucene2Whiteboard might help.  Something like http://
> wiki.apache.org/jakarta-lucene/Lucene2.1Whiteboard
>
> Otis

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Grant Ingersoll
In reply to this post by Otis Gospodnetic-2
Sure, or even a place called Lucene Planning or Lucene Strategy.  Just
not sure if it should be only on the Java side or not.  Or even
Lucene3Whiteboard (did I really write Lucene 3?!?)

Otis Gospodnetic wrote:

> You mean you want us to be more organized!?!? :)
> I think a Wiki page like http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard might help.  Something like http://wiki.apache.org/jakarta-lucene/Lucene2.1Whiteboard
>
> Otis
>
> ----- Original Message ----
> From: Grant Ingersoll <[hidden email]>
> To: Lucene Developer's List <[hidden email]>
> Sent: Wednesday, May 10, 2006 7:05:30 AM
> Subject: Taking a step back
>
> Is it just me or do we have a whole bunch of people proposing a whole
> bunch of fairly broad changes to Lucene? (I know, I know, they should
> always be backward compatible)  Might this warrant some
> coordination/planning?  I know things are mostly done in an ad-hoc way
> (whoever submits a patch), but I think we may all be better served by
> some coordination beyond what takes place on the mailing list.  I see
> some pieces here and there that would benefit from common code, etc.
>
> As I see it, we have several people proposing file format changes, Otis
> and some others want scoring changes, I have discussed with a few people
> the ability to make more pluggable how fields are indexed plus the
> ability to add metadata at all levels of the index (field, document,
> index, etc.), more to come on this soon.  Additionally, the lazy loading
> field stuff is pending and would benefit from a few file format changes
> as well
>
> Additionally, we can't just think of the Java version anymore,
> especially when it comes to file formats, I don't think.  Should we,
> perhaps, setup a top-level wiki-style planning place?  Would this be
> useful?  I don't think it replaces the good discussions on this list, I
> just think it could give potential contributors a much easier way of
> finding how to help, plus set a (albeit loose) plan for the future of
> Lucene beyond what is captured in snippets of email here and there.
>
> Just my two cents,
> Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Karl Wettin-3
On Wed, 2006-05-10 at 13:29 -0400, Grant Ingersoll wrote:
> Or even Lucene3Whiteboard (did I really write Lucene 3?!?)

You know, I was just thinking that it would be nice if Lucene was
developed like the Linux kernels. When 2.6 is stable, people are beta
testing 2.7 and some hack 2.8.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Doug Cutting
Lucene version numbers are about compatibility.

Minor versions should always have complete API back-compatiblity.
That's to say, any code developed against X.0 should continue to run
without alteration against all X.N releases.  A major release may
introduce incompatible API changes.  The transition strategy is to
introduce new APIs in release X.N, deprecating old APIs, then remove all
deprecated APIs in release X+1.0.

File formats are back-compatible between major versions.  Version X.N
should be able to read indexes generated by any version after and
including version X-1.0, but may-or-may-not be able to read indexes
generated by version X-2.N.

Note that older releases are never guaranteed to be able to read indexes
generated by newer releases.  When this is attempted, a predictable
error should be generated.

Does that sound reasonable?

Doug

karl wettin wrote:

> On Wed, 2006-05-10 at 13:29 -0400, Grant Ingersoll wrote:
>
>>Or even Lucene3Whiteboard (did I really write Lucene 3?!?)
>
>
> You know, I was just thinking that it would be nice if Lucene was
> developed like the Linux kernels. When 2.6 is stable, people are beta
> testing 2.7 and some hack 2.8.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Taking a step back

Robert Engels
What about the case where a "bug" is found that necessitates a file format
change.

Obviously this should be VERY rare given adequate testing, but it seems
difficult to make a hard and fast rule that X.0 should be able to ALWAYS
read X.N.


-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Wednesday, May 10, 2006 1:14 PM
To: [hidden email]
Subject: Re: Taking a step back


Lucene version numbers are about compatibility.

Minor versions should always have complete API back-compatiblity.
That's to say, any code developed against X.0 should continue to run
without alteration against all X.N releases.  A major release may
introduce incompatible API changes.  The transition strategy is to
introduce new APIs in release X.N, deprecating old APIs, then remove all
deprecated APIs in release X+1.0.

File formats are back-compatible between major versions.  Version X.N
should be able to read indexes generated by any version after and
including version X-1.0, but may-or-may-not be able to read indexes
generated by version X-2.N.

Note that older releases are never guaranteed to be able to read indexes
generated by newer releases.  When this is attempted, a predictable
error should be generated.

Does that sound reasonable?

Doug

karl wettin wrote:

> On Wed, 2006-05-10 at 13:29 -0400, Grant Ingersoll wrote:
>
>>Or even Lucene3Whiteboard (did I really write Lucene 3?!?)
>
>
> You know, I was just thinking that it would be nice if Lucene was
> developed like the Linux kernels. When 2.6 is stable, people are beta
> testing 2.7 and some hack 2.8.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Karl Wettin-3
In reply to this post by Doug Cutting
On Wed, 2006-05-10 at 11:13 -0700, Doug Cutting wrote:

> File formats are back-compatible between major versions.  Version X.N
> should be able to read indexes generated by any version after and
> including version X-1.0, but may-or-may-not be able to read indexes
> generated by version X-2.N.
>
> Note that older releases are never guaranteed to be able to read
> indexes generated by newer releases.  When this is attempted, a
> predictable error should be generated.
>
> Does that sound reasonable?

It sounds great.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Bill Janssen
In reply to this post by Doug Cutting
> File formats are back-compatible between major versions.  Version X.N
> should be able to read indexes generated by any version after and
> including version X-1.0, but may-or-may-not be able to read indexes
> generated by version X-2.N.
>
> Note that older releases are never guaranteed to be able to read indexes
> generated by newer releases.  When this is attempted, a predictable
> error should be generated.
>
> Does that sound reasonable?

Have you put a field in the file format yet that gives its version?
Alternatively, is there a way to find out which version of Lucene
needs to be used with a given index?

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Karl Wettin-3
In reply to this post by Doug Cutting
On Wed, 2006-05-10 at 11:13 -0700, Doug Cutting wrote:

>
> File formats are back-compatible between major versions.  Version X.N
> should be able to read indexes generated by any version after and
> including version X-1.0, but may-or-may-not be able to read indexes
> generated by version X-2.N.
>
> Note that older releases are never guaranteed to be able to read
> indexes generated by newer releases.  When this is attempted, a
> predictable error should be generated.
>
> Does that sound reasonable?

It sounds great.

Is the ability to "upgrade" an index enough? I think in case of radical
reconstruction surgery.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Lucene Index comparison..

Krishnan, Ananda
In reply to this post by Grant Ingersoll
Hi

Can anyone please help me to know about how to compare two different lucene indexes.

Thanks and reg
Anandh.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene Index comparison..

Karl Wettin-3
On Thu, 2006-05-11 at 10:32 +0530, Krishnan, Ananda wrote:
> Hi
>
> Can anyone please help me to know about how to compare two different lucene indexes.

I think there has been four instances of this question lately, including
my own.

Mine is only compatible with the 1.9_20060505-karl1 branch, and really
more of a test case. I will add it to my next update of
<http://issues.apache.org/jira/browse/LUCENE-550>. I can send you the
broken snapshot off list if you want.

It is supplied with two index readers and
 * iterates and compare all terms, documents and positions.
 * places a couple of searches and compare the results.
 * does not compare the term frequency vectors.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Doug Cutting
In reply to this post by Bill Janssen
Bill Janssen wrote:
> Have you put a field in the file format yet that gives its version?
> Alternatively, is there a way to find out which version of Lucene
> needs to be used with a given index?

The segments file has a format number, as do many other files, but the
segments file is the only file global to an index with a format number.
  The segments format number is in the first 32-bits of the file.  If
non-negative, indicates format zero, otherwise the format number is the
negated.

It should like you'd like something that, given an index, will tell you
what version of Lucene wrote it.  That would indeed be nice, but does
not yet exist.  Please contribute a patch if this is very important to
you, or at least submit a bug report and pray that someone else patches
it for you.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Marvin Humphrey
In reply to this post by Grant Ingersoll

On May 10, 2006, at 4:05 AM, Grant Ingersoll wrote:

> As I see it, we have several people proposing file format changes,  
> Otis and some others want scoring changes, I have discussed with a  
> few people the ability to make more pluggable how fields are  
> indexed plus the ability to add metadata at all levels of the index  
> (field, document, index, etc.), more to come on this soon.  
> Additionally, the lazy loading field stuff is pending and would  
> benefit from a few file format changes as well.

http://en.wikipedia.org/wiki/Feeping_creaturism

The pressure is accumulating.

If Lucene is to stay usable and maintainable, some people, some  
times, have to be told "thank you, but no".

Not that I'm volunteering.  ;)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Marvin Humphrey
In reply to this post by Robert Engels

On May 10, 2006, at 8:02 AM, Robert Engels wrote:

> The file format issue whoever is a non-issue. If you want  
> interoperability
> between systems do it via remote invocation and IIOP, or some HTTP
> interface. This is far more easier to control, especially through  
> version
> change cycles - otherwise all platforms need to be updated together  
> - which
> is very hard to do (unless you are using Java with WORA !).
>
> I also don't understand why Lucene doesn't focus on being THE JAVA  
> search
> engine. Anything I think that detracts that from moving forward  
> should be
> out of scope.

I really don't relish the prospect that this might degenerate into a  
language argument, but I think it falls to me to respond, since the  
patch I submitted on Monday opens up a lot of possibilities for interop.

I don't necessarily disagree.

Abandoning all attempts at interop has its advantages.  One  
unfortunate albeit unavoidable aspect of Lucene is that it is tightly  
bound to its file format.  In a perfect world, the file reading/
writing apparatus would be modular: the index would be read into  
memory using a plugin, manipulated, then saved using another plugin.  
That doesn't work, obviously, because indexes are commonly too large  
to be read into available RAM, and so the I/O stuff is scattered over  
the entire library, which makes maintaining compatibility laborious.

However, Lucene has to make some effort to track its file format  
definition, so that it may live up to the commitments for backwards-
compatibility codified earlier in this thread.  This is currently  
done using the File Formats document (though that document is  
incomplete and buggy).  There's not much difference between  
supporting the files written by an earlier version of Lucene and  
supporting the files written by another implementation of Lucene  
which adhere to the same spec.

The only question is whether there are Java-specific optimizations  
which are so advantageous that they outweigh the benefits of  
interchange.  There is no inherent advantage in using Modified UTF-8  
over standard UTF-8, and the UTF-8 code I supplied actually speeds up  
Lucene by a couple percent because it simplifies some conditionals --  
all of the performance hit comes from using a bytecount as the String  
prefix.  I have good reasons to believe that this can go away, not  
the least of which is I've actually written a working implementation  
in Perl/C which uses bytecounts and I know where all the bottlenecks  
are.

There are also advantages to keeping the file format public, both for  
Java Lucene and for the larger Apache Lucene project.  Of course  
there's the the raw usefulness of interchange.  For instance, it  
might be nice to whip up a little script in Perl or Ruby which works  
with your existing rig -- especially if there's a CPAN module that  
offers functionality you need which isn't available yet in Java, or  
you'd benefit from a near-instantaneous startup time.

But more important, I'd argue, is that having all implementations  
share a common file format means that all the authors have an  
amplified interest in coordinating, communicating, and contributing.  
Just as learning new languages, programming or natural, broadens an  
individual's horizons, so does working out an implementation based on  
Lucene's data structures in another language lead to fresh thinking.  
The more cross-pollination of ideas from various authors and by  
proxy, their extended communities, the more all of the sub-projects  
gain and the faster Apache Lucene as a whole advances.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Doug Cutting
In reply to this post by Marvin Humphrey
Marvin Humphrey wrote:
> If Lucene is to stay usable and maintainable, some people, some  times,
> have to be told "thank you, but no".

There's lots of talk.  Fortunately, in this case, the talk to patch
ratio is typically high in open source projects.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Taking a step back

Doug Cutting
In reply to this post by Marvin Humphrey
Marvin Humphrey wrote:
> The only question is whether there are Java-specific optimizations  
> which are so advantageous that they outweigh the benefits of  
> interchange.

It's not just optimizations.  If we, e.g., wrote, for each field, the
name of the codec class that it uses, then we could provide arbitrary
extensibility.  Anything that implemented the field codec API could be
used, permitting alternate posting compression algorithms, etc.  But
that would not be friendly to other implementations, which may not be
able to easily instantiate classses from class names, nor dynamically
download codec implementations from a public repository, etc.  The fact
that java bytecode is portable makes this more attractive.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Taking a step back

Robert Engels
In reply to this post by Marvin Humphrey
I disagree with that a bit. I have found that certain languages lend
themselves far better to certain file formats (that is, if an operation is
very efficient to perform in a particular language, using a file format that
allows the usage of that operation directly will often lead to much better
performance). This is often true with byte ordering on particular hardware
platforms. That is the whole reason this is an issue. Others can read the
modified UTF, it is just not as efficient for them !

But more importantly, I don't think Lucene (or others) should be "held back"
attempting to adhere to a standardized file format.

Take databases for example. Many available. All use different file formats,
but all can be accessed with (pretty much) standardized SQL (using different
drivers).

I think Lucene could offer a similar approach at the API level, maybe an
embedded TCP/IP interface / command processor (similar to an HTTP server).

You are always going to have interoperability issues (sometimes even when
using Java, but rarely), so I say dump the burden on the others, and just
make Lucene the best Java search engine possible.

Without starting some sort of flame war, I can't think of any advantages to
not running a Java version of Lucene, but, that is just my opinion. It would
be fairly straight forward to convert all of Lucene to C, and provide a Java
binding, but why???



-----Original Message-----
From: Marvin Humphrey [mailto:[hidden email]]
Sent: Thursday, May 11, 2006 12:08 PM
To: [hidden email]
Subject: Re: Taking a step back



On May 10, 2006, at 8:02 AM, Robert Engels wrote:

> The file format issue whoever is a non-issue. If you want
> interoperability
> between systems do it via remote invocation and IIOP, or some HTTP
> interface. This is far more easier to control, especially through
> version
> change cycles - otherwise all platforms need to be updated together
> - which
> is very hard to do (unless you are using Java with WORA !).
>
> I also don't understand why Lucene doesn't focus on being THE JAVA
> search
> engine. Anything I think that detracts that from moving forward
> should be
> out of scope.

I really don't relish the prospect that this might degenerate into a
language argument, but I think it falls to me to respond, since the
patch I submitted on Monday opens up a lot of possibilities for interop.

I don't necessarily disagree.

Abandoning all attempts at interop has its advantages.  One
unfortunate albeit unavoidable aspect of Lucene is that it is tightly
bound to its file format.  In a perfect world, the file reading/
writing apparatus would be modular: the index would be read into
memory using a plugin, manipulated, then saved using another plugin.
That doesn't work, obviously, because indexes are commonly too large
to be read into available RAM, and so the I/O stuff is scattered over
the entire library, which makes maintaining compatibility laborious.

However, Lucene has to make some effort to track its file format
definition, so that it may live up to the commitments for backwards-
compatibility codified earlier in this thread.  This is currently
done using the File Formats document (though that document is
incomplete and buggy).  There's not much difference between
supporting the files written by an earlier version of Lucene and
supporting the files written by another implementation of Lucene
which adhere to the same spec.

The only question is whether there are Java-specific optimizations
which are so advantageous that they outweigh the benefits of
interchange.  There is no inherent advantage in using Modified UTF-8
over standard UTF-8, and the UTF-8 code I supplied actually speeds up
Lucene by a couple percent because it simplifies some conditionals --
all of the performance hit comes from using a bytecount as the String
prefix.  I have good reasons to believe that this can go away, not
the least of which is I've actually written a working implementation
in Perl/C which uses bytecounts and I know where all the bottlenecks
are.

There are also advantages to keeping the file format public, both for
Java Lucene and for the larger Apache Lucene project.  Of course
there's the the raw usefulness of interchange.  For instance, it
might be nice to whip up a little script in Perl or Ruby which works
with your existing rig -- especially if there's a CPAN module that
offers functionality you need which isn't available yet in Java, or
you'd benefit from a near-instantaneous startup time.

But more important, I'd argue, is that having all implementations
share a common file format means that all the authors have an
amplified interest in coordinating, communicating, and contributing.
Just as learning new languages, programming or natural, broadens an
individual's horizons, so does working out an implementation based on
Lucene's data structures in another language lead to fresh thinking.
The more cross-pollination of ideas from various authors and by
proxy, their extended communities, the more all of the sub-projects
gain and the faster Apache Lucene as a whole advances.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Taking a step back

Robert Engels
In reply to this post by Doug Cutting
Exactly. If people don't get the REAL value of Java by now, they are
probably not going to ever get it. Weighing ALL of the pros/cons, developing
modern software in anything else is just silly. But, arguing this is akin to
discussing religion...

-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Thursday, May 11, 2006 12:20 PM
To: [hidden email]
Subject: Re: Taking a step back


Marvin Humphrey wrote:
> The only question is whether there are Java-specific optimizations
> which are so advantageous that they outweigh the benefits of
> interchange.

It's not just optimizations.  If we, e.g., wrote, for each field, the
name of the codec class that it uses, then we could provide arbitrary
extensibility.  Anything that implemented the field codec API could be
used, permitting alternate posting compression algorithms, etc.  But
that would not be friendly to other implementations, which may not be
able to easily instantiate classses from class names, nor dynamically
download codec implementations from a public repository, etc.  The fact
that java bytecode is portable makes this more attractive.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12