Re: [Performance] Streaming main memory indexing of single strings

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
I've uploaded slightly improved versions of the fast MemoryIndex  
contribution to http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 
along with another contrib - PatternAnalyzer.
 
For a quick overview without downloading code, there's javadoc for it  
all at  
http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
summary.html

I'm happy to maintain these classes externally as part of the Nux  
project. But from the preliminary discussion on the list some time ago  
I gathered there'd be some wider interest, hence I prepared the  
contribs for the community. What would be the next steps for taking  
this further, if any?

Thanks,
Wolfgang.

/**
  * Efficient Lucene analyzer/tokenizer that preferably operates on a  
String
rather than a
  * {@link java.io.Reader}, that can flexibly separate on a regular  
expression
{@link Pattern}
  * (with behaviour idential to {@link String#split(String)}),
  * and that combines the functionality of
  * {@link org.apache.lucene.analysis.LetterTokenizer},
  * {@link org.apache.lucene.analysis.LowerCaseTokenizer},
  * {@link org.apache.lucene.analysis.WhitespaceTokenizer},
  * {@link org.apache.lucene.analysis.StopFilter} into a single efficient
  * multi-purpose class.
  * <p>
  * If you are unsure how exactly a regular expression should look like,
consider
  * prototyping by simply trying various expressions on some test texts  
via
  * {@link String#split(String)}. Once you are satisfied, give that  
regex to
  * PatternAnalyzer. Also see <a target="_blank"
  * href="http://java.sun.com/docs/books/tutorial/extra/regex/">Java  
Regular
Expression Tutorial</a>.
  * <p>
  * This class can be considerably faster than the "normal" Lucene  
tokenizers.
  * It can also serve as a building block in a compound Lucene
  * {@link org.apache.lucene.analysis.TokenFilter} chain. For example as  
in this

  * stemming example:
  * <pre>
  * PatternAnalyzer pat = ...
  * TokenStream tokenStream = new SnowballFilter(
  *     pat.tokenStream("content", "James is running round in the  
woods"),
  *     "English"));
  * </pre>



On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote:

> I've now got the contrib code cleaned up, tested and documented into a  
> decent state, ready for your review and comments.
> Consider this a formal contrib (Apache license is attached).
>
> The relevant files are attached to the following bug ID:
>
> http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
>
> For a quick overview without downloading code, there's some javadoc at  
> http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
> summary.html
>
> There are several small open issues listed in the javadoc and also  
> inside the code. Thoughts? Comments?
>
> I've also got small performance patches for various parts of Lucene  
> core (not submitted yet). Taken together they lead to substantially  
> improved performance for MemoryIndex, and most likely also for Lucene  
> in general. Some of them are more involved than others. I'm now  
> figuring out how much performance each of these contributes and how to  
> propose potential integration - stay tuned for some follow-ups to  
> this.
>
> The code as submitted would certainly benefit a lot from said patches,  
> but they are not required for correct operation. It should work out of  
> the box (currently only on 1.4.3 or lower). Try running
>
> cd lucene-cvs
> java org.apache.lucene.index.memory.MemoryIndexTest
>
> with or without custom arguments to see it in action.
>
> Before turning to a performance patch discussion I'd a this point  
> rather be most interested in folks giving it a spin, comments on the  
> API, or any other issues.
>
> Cheers,
> Wolfgang.
>
> On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
>
>> On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
>>
>>>
>>> On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
>>>> By the way, by now I have a version against 1.4.3 that is 10-100  
>>>> times faster (i.e. 30000 - 200000 index+query steps/sec) than the  
>>>> simplistic RAMDirectory approach, depending on the nature of the  
>>>> input data and query. From some preliminary testing it returns  
>>>> exactly what RAMDirectory returns.
>>>
>>> Awesome.  Using the basic StringIndexReader I sent?
>>
>> Yep, it's loosely based on the empty skeleton you sent.
>>
>>>
>>> I've been fiddling with it a bit more to get other query types.  
>>> I'll add it to the contrib area when its a bit more robust.
>>
>> Perhaps we could merge up once I'm ready and put that into the  
>> contrib area? My version now supports tokenization with any analyzer  
>> and it supports any arbitrary Lucene query. I might make the API for  
>> adding terms a little more general, perhaps allowing arbitrary  
>> Document objects if that's what other folks really need...
>>
>>>
>>>> As an aside, is there any work going on to potentially support  
>>>> prefix (and infix) wild card queries ala "*fish"?
>>>
>>> WildcardQuery supports wildcard characters anywhere in the string.  
>>> QueryParser itself restricts expressions that have leading wildcards  
>>> from being accepted.
>>
>> Any particular reason for this restriction? Is this simply a current  
>> parser limitation or something inherent?
>>
>>> QueryParser supports wildcard characters in the middle of strings no  
>>> problem though.  Are you seeing otherwise?
>>
>> I ment an infix query such as "*fish*"
>>
>> Wolfgang.
>>
>>
>> ----------------------------------------------------------------------
>> -
>> Wolfgang Hoschek                  |   email: [hidden email]
>> Distributed Systems Department    |   phone: (415)-533-7610
>> Berkeley Laboratory               |   http://dsd.lbl.gov/~hoschek/
>> ----------------------------------------------------------------------
>> -
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Erik Hatcher
Wolfgang,

You have provided a superb set of patches!  I'm in awe of the extensive  
documentation you've done.

There is nothing further you need to do, but be patient while we  
incorporate it into the contrib area somewhere.  Your PatternAnalyzer  
could fit into the contrib/analyzers area nicely.  I'm not quite sure  
where to put MemoryIndex - maybe it deserves to stand on its own in a  
new contrib area?  Or does it make sense to put this into misc (still  
in sandbox/misc)?  Or where?

        Erik

On Apr 26, 2005, at 9:47 PM, Wolfgang Hoschek wrote:

> I've uploaded slightly improved versions of the fast MemoryIndex  
> contribution to  
> http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 along with  
> another contrib - PatternAnalyzer.
>  
> For a quick overview without downloading code, there's javadoc for it  
> all at  
> http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
> summary.html
>
> I'm happy to maintain these classes externally as part of the Nux  
> project. But from the preliminary discussion on the list some time ago  
> I gathered there'd be some wider interest, hence I prepared the  
> contribs for the community. What would be the next steps for taking  
> this further, if any?
>
> Thanks,
> Wolfgang.
>
> /**
>  * Efficient Lucene analyzer/tokenizer that preferably operates on a  
> String
> rather than a
>  * {@link java.io.Reader}, that can flexibly separate on a regular  
> expression
> {@link Pattern}
>  * (with behaviour idential to {@link String#split(String)}),
>  * and that combines the functionality of
>  * {@link org.apache.lucene.analysis.LetterTokenizer},
>  * {@link org.apache.lucene.analysis.LowerCaseTokenizer},
>  * {@link org.apache.lucene.analysis.WhitespaceTokenizer},
>  * {@link org.apache.lucene.analysis.StopFilter} into a single  
> efficient
>  * multi-purpose class.
>  * <p>
>  * If you are unsure how exactly a regular expression should look like,
> consider
>  * prototyping by simply trying various expressions on some test texts  
> via
>  * {@link String#split(String)}. Once you are satisfied, give that  
> regex to
>  * PatternAnalyzer. Also see <a target="_blank"
>  * href="http://java.sun.com/docs/books/tutorial/extra/regex/">Java  
> Regular
> Expression Tutorial</a>.
>  * <p>
>  * This class can be considerably faster than the "normal" Lucene  
> tokenizers.
>  * It can also serve as a building block in a compound Lucene
>  * {@link org.apache.lucene.analysis.TokenFilter} chain. For example  
> as in this
>
>  * stemming example:
>  * <pre>
>  * PatternAnalyzer pat = ...
>  * TokenStream tokenStream = new SnowballFilter(
>  *     pat.tokenStream("content", "James is running round in the  
> woods"),
>  *     "English"));
>  * </pre>
>
>
>
> On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote:
>
>> I've now got the contrib code cleaned up, tested and documented into  
>> a decent state, ready for your review and comments.
>> Consider this a formal contrib (Apache license is attached).
>>
>> The relevant files are attached to the following bug ID:
>>
>> http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
>>
>> For a quick overview without downloading code, there's some javadoc  
>> at  
>> http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
>> summary.html
>>
>> There are several small open issues listed in the javadoc and also  
>> inside the code. Thoughts? Comments?
>>
>> I've also got small performance patches for various parts of Lucene  
>> core (not submitted yet). Taken together they lead to substantially  
>> improved performance for MemoryIndex, and most likely also for Lucene  
>> in general. Some of them are more involved than others. I'm now  
>> figuring out how much performance each of these contributes and how  
>> to propose potential integration - stay tuned for some follow-ups to  
>> this.
>>
>> The code as submitted would certainly benefit a lot from said  
>> patches, but they are not required for correct operation. It should  
>> work out of the box (currently only on 1.4.3 or lower). Try running
>>
>> cd lucene-cvs
>> java org.apache.lucene.index.memory.MemoryIndexTest
>>
>> with or without custom arguments to see it in action.
>>
>> Before turning to a performance patch discussion I'd a this point  
>> rather be most interested in folks giving it a spin, comments on the  
>> API, or any other issues.
>>
>> Cheers,
>> Wolfgang.
>>
>> On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
>>
>>> On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
>>>
>>>>
>>>> On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
>>>>> By the way, by now I have a version against 1.4.3 that is 10-100  
>>>>> times faster (i.e. 30000 - 200000 index+query steps/sec) than the  
>>>>> simplistic RAMDirectory approach, depending on the nature of the  
>>>>> input data and query. From some preliminary testing it returns  
>>>>> exactly what RAMDirectory returns.
>>>>
>>>> Awesome.  Using the basic StringIndexReader I sent?
>>>
>>> Yep, it's loosely based on the empty skeleton you sent.
>>>
>>>>
>>>> I've been fiddling with it a bit more to get other query types.  
>>>> I'll add it to the contrib area when its a bit more robust.
>>>
>>> Perhaps we could merge up once I'm ready and put that into the  
>>> contrib area? My version now supports tokenization with any analyzer  
>>> and it supports any arbitrary Lucene query. I might make the API for  
>>> adding terms a little more general, perhaps allowing arbitrary  
>>> Document objects if that's what other folks really need...
>>>
>>>>
>>>>> As an aside, is there any work going on to potentially support  
>>>>> prefix (and infix) wild card queries ala "*fish"?
>>>>
>>>> WildcardQuery supports wildcard characters anywhere in the string.  
>>>> QueryParser itself restricts expressions that have leading  
>>>> wildcards from being accepted.
>>>
>>> Any particular reason for this restriction? Is this simply a current  
>>> parser limitation or something inherent?
>>>
>>>> QueryParser supports wildcard characters in the middle of strings  
>>>> no problem though.  Are you seeing otherwise?
>>>
>>> I ment an infix query such as "*fish*"
>>>
>>> Wolfgang.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> --
>>> Wolfgang Hoschek                  |   email: [hidden email]
>>> Distributed Systems Department    |   phone: (415)-533-7610
>>> Berkeley Laboratory               |   http://dsd.lbl.gov/~hoschek/
>>> ---------------------------------------------------------------------
>>> --
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Doug Cutting
Erik Hatcher wrote:
> I'm not quite sure  
> where to put MemoryIndex - maybe it deserves to stand on its own in a  
> new contrib area?

That sounds good to me.

> Or does it make sense to put this into misc (still  
> in sandbox/misc)?  Or where?

Isn't the goal for sandbox/ to go away, replaced with contrib/?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
Whichever place you settle on is fine with me.

[In case it might make a difference: Just note that MemoryIndex has a
small auxiliary dependency on PatternAnalyzer in addField() because the
Analyzer superclass doesn't have a tokenStream(String fieldName, String
text) method. And PatternAnalyzer requires JDK 1.4 or higher]

Wolfgang.

On Apr 27, 2005, at 9:22 AM, Doug Cutting wrote:

> Erik Hatcher wrote:
>> I'm not quite sure  where to put MemoryIndex - maybe it deserves to
>> stand on its own in a  new contrib area?
>
> That sounds good to me.
>
>> Or does it make sense to put this into misc (still  in sandbox/misc)?
>>  Or where?
>
> Isn't the goal for sandbox/ to go away, replaced with contrib/?
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Erik Hatcher
In reply to this post by Doug Cutting
On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote:
> Erik Hatcher wrote:
>> I'm not quite sure  where to put MemoryIndex - maybe it deserves to
>> stand on its own in a  new contrib area?
>
> That sounds good to me.

Ok... once Wolfgang gives me one last round up updates (JUnit tests
instead of main() and upgrade it to work with trunk) I'll do that.  I
had put it in miscellaneous but will create its only sub-contrib area
instead.

>
>> Or does it make sense to put this into misc (still  in sandbox/misc)?
>>  Or where?
>
> Isn't the goal for sandbox/ to go away, replaced with contrib/?

Yes.  In fact, I moved the last relevant piece
(sandbox/contributions/miscellaneous) to contrib last night.   I think
both the parsers and XML-Indexing-Demo found in the sandbox are not
worth preserving.  Anyone feel that these pieces left in the sandbox
should be preserved?

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
OK. I'll send an update as soon as I get round to it...
Wolfgang.

> On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote:
>> Erik Hatcher wrote:
>>> I'm not quite sure  where to put MemoryIndex - maybe it deserves to
>>> stand on its own in a  new contrib area?
>>
>> That sounds good to me.
>
> Ok... once Wolfgang gives me one last round up updates (JUnit tests
> instead of main() and upgrade it to work with trunk) I'll do that.  I
> had put it in miscellaneous but will create its only sub-contrib area
> instead.
>
>>
>>> Or does it make sense to put this into misc (still  in
>>> sandbox/misc)?  Or where?
>>
>> Isn't the goal for sandbox/ to go away, replaced with contrib/?
>
> Yes.  In fact, I moved the last relevant piece
> (sandbox/contributions/miscellaneous) to contrib last night.   I think
> both the parsers and XML-Indexing-Demo found in the sandbox are not
> worth preserving.  Anyone feel that these pieces left in the sandbox
> should be preserved?
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
In reply to this post by Erik Hatcher
I've uploaded code that now runs against the current SVN, plus junit
test cases, plus some minor internal updates to the functionality
itself. For details see
http://issues.apache.org/bugzilla/show_bug.cgi?id=34585

Be prepared for the testcases to take some minutes to complete - don't
hit CTRL-C :-)
Erik, if nobody objects, can you please put this into a contrib area,
e.g. module "memory" in org.apache.lucene.index.memory, or similar?
Thanks,
Wolfgang.

On Apr 27, 2005, at 10:30 AM, Erik Hatcher wrote:

> On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote:
>> Erik Hatcher wrote:
>>> I'm not quite sure  where to put MemoryIndex - maybe it deserves to
>>> stand on its own in a  new contrib area?
>>
>> That sounds good to me.
>
> Ok... once Wolfgang gives me one last round up updates (JUnit tests
> instead of main() and upgrade it to work with trunk) I'll do that.  I
> had put it in miscellaneous but will create its only sub-contrib area
> instead.
>
>>
>>> Or does it make sense to put this into misc (still  in
>>> sandbox/misc)?  Or where?
>>
>> Isn't the goal for sandbox/ to go away, replaced with contrib/?
>
> Yes.  In fact, I moved the last relevant piece
> (sandbox/contributions/miscellaneous) to contrib last night.   I think
> both the parsers and XML-Indexing-Demo found in the sandbox are not
> worth preserving.  Anyone feel that these pieces left in the sandbox
> should be preserved?
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Erik Hatcher

On May 1, 2005, at 10:20 PM, Wolfgang Hoschek wrote:

> I've uploaded code that now runs against the current SVN, plus junit  
> test cases, plus some minor internal updates to the functionality  
> itself. For details see  
> http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
>
> Be prepared for the testcases to take some minutes to complete - don't  
> hit CTRL-C :-)
> Erik, if nobody objects, can you please put this into a contrib area,  
> e.g. module "memory" in org.apache.lucene.index.memory, or similar?

I have committed it into contrib/memory.  I made a few minor tweaks  
such as 2005 for year in license header, putting package statement  
above license, and adjusting the paths in the test case to match our  
standard src/test and src/java structure.

The test case is failing (type "ant test" at the contrib/memory working  
directory) with this:

     [junit] Testcase:  
testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
ERROR
     [junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
     [junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
query=term AND NOT phrase term,  
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
     [junit]     at  
org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java:
305)
     [junit]     at  
org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest.
java:228)

Your conversion to a JUnit test case was not quite what I had in mind  
:)  You simply wrapped your main() into a testMany method.  But it is  
fine for now as it is easily converted into more granular testXXX  
methods that use the JUnit assert* methods.  The paths to test files  
will likely need to be parameterized and passed in from Ant's <junit>  
task via system properties in order to run correctly regardless of  
working directory.  These things are easily tweaked though and not  
worth holding back the initial commit.

Again, I'm impressed with your level of javadocs and thoroughness in  
the code.  Good stuff!

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
I'm looking at it right now. The tests pass fine when you put  
lucene-1.4.3.jar instead of the current lucene onto the classpath which  
is what I've been doing so far. Something seems to have changed in the  
scoring calculation. No idea what that might be. I'll see if I can find  
out.

Wolfgang.

> The test case is failing (type "ant test" at the contrib/memory  
> working directory) with this:
>
>     [junit] Testcase:  
> testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
> ERROR
>     [junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>     [junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
> query=term AND NOT phrase term,  
> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>     [junit]     at  
> org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java
> :305)
>     [junit]     at  
> org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest
> .java:228)
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
This is what I have as scoring calculation, and it seems to do exactly  
what lucene-1.4.3 does because the tests pass.

                public byte[] norms(String fieldName) {
                        if (DEBUG) System.err.println("MemoryIndexReader.norms: " +  
fieldName);
                        Info info = getInfo(fieldName);
                        int numTokens = info != null ? info.numTokens : 0;
                        byte norm =  
Similarity.encodeNorm(getSimilarity().lengthNorm(fieldName,  
numTokens));
                        return new byte[] {norm};
                }
       
                public void norms(String fieldName, byte[] bytes, int offset) {
                        if (DEBUG) System.err.println("MemoryIndexReader.norms: " +  
fieldName + "*");
                        byte[] norms = norms(fieldName);
                        System.arraycopy(norms, 0, bytes, offset, norms.length);
                }

                private Similarity getSimilarity() {
                        return searcher.getSimilarity(); // this is the normal lucene  
IndexSearcher
                }
               

Can anyone see what's wrong with it for lucene current SVN? Should my  
calculation now be done differently? If so, how?
Thanks for any clues into the right direction.
Wolfgang.

On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote:

> I'm looking at it right now. The tests pass fine when you put  
> lucene-1.4.3.jar instead of the current lucene onto the classpath  
> which is what I've been doing so far. Something seems to have changed  
> in the scoring calculation. No idea what that might be. I'll see if I  
> can find out.
>
> Wolfgang.
>
>> The test case is failing (type "ant test" at the contrib/memory  
>> working directory) with this:
>>
>>     [junit] Testcase:  
>> testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
>> ERROR
>>     [junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>>     [junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
>> query=term AND NOT phrase term,  
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>>     [junit]     at  
>> org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav
>> a:305)
>>     [junit]     at  
>> org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes
>> t.java:228)
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
In reply to this post by Wolfgang Hoschek
Finally found and fixed the bug!
The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo()  
with the following:

                                public boolean skipTo(int target) {
                                        if (DEBUG) System.err.println(".skipTo: " + target);
                                        return next();
                                }

Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered the  
bug, while SVN does.

I now ran the tests over a much larger set of documents and all tests  
pass. Give it a shot :-)
Wolfgang.


On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote:

> I'm looking at it right now. The tests pass fine when you put  
> lucene-1.4.3.jar instead of the current lucene onto the classpath  
> which is what I've been doing so far. Something seems to have changed  
> in the scoring calculation. No idea what that might be. I'll see if I  
> can find out.
>
> Wolfgang.
>
>> The test case is failing (type "ant test" at the contrib/memory  
>> working directory) with this:
>>
>>     [junit] Testcase:  
>> testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
>> ERROR
>>     [junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>>     [junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
>> query=term AND NOT phrase term,  
>> file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
>> anal=org.apache.lucene.analysis.SimpleAnalyzer@127b52
>>     [junit]     at  
>> org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav
>> a:305)
>>     [junit]     at  
>> org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes
>> t.java:228)
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Paul Elschot
Wolfgang,

On Monday 02 May 2005 23:21, Wolfgang Hoschek wrote:

> Finally found and fixed the bug!
> The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo()  
> with the following:
>
> public boolean skipTo(int target) {
> if (DEBUG) System.err.println(".skipTo: " + target);
> return next();
> }
>
> Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered the  
> bug, while SVN does.

Yes, the svn trunk uses skipTo more often than 1.4.3.

However, your implementation of skipTo() needs some improvement.
See the javadoc of skipTo of class Scorer:

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Scorer.html#skipTo(int)

In case the underlying scorers provide skipTo() it's even better to use that.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
> Yes, the svn trunk uses skipTo more often than 1.4.3.
>
> However, your implementation of skipTo() needs some improvement.
> See the javadoc of skipTo of class Scorer:
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
> Scorer.html#skipTo(int)

What's wrong with the version I sent? Remeber that there can be at most  
one document in a MemoryIndex. So the "target" parameter can safely be  
ignored, as far as I can see.

>
> In case the underlying scorers provide skipTo() it's even better to  
> use that.
>

The version I sent returns in O(1), if performance was your concern. Or  
did you mean something else?

Wolfgang.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Paul Elschot
On Monday 02 May 2005 23:38, Wolfgang Hoschek wrote:

> > Yes, the svn trunk uses skipTo more often than 1.4.3.
> >
> > However, your implementation of skipTo() needs some improvement.
> > See the javadoc of skipTo of class Scorer:
> >
> > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
> > Scorer.html#skipTo(int)
>
> What's wrong with the version I sent? Remeber that there can be at most  
> one document in a MemoryIndex. So the "target" parameter can safely be  
> ignored, as far as I can see.

Correct, I did not realize that there is only a single doc in the index.

>
> >
> > In case the underlying scorers provide skipTo() it's even better to  
> > use that.
> >
>
> The version I sent returns in O(1), if performance was your concern. Or  
> did you mean something else?

Since 0 is the only document number in the index, a

return target == 0;

might be nice for skipTo(). It doesn't really help performance, though,
and the next() works just as well.

Regards,
Paul Elschot.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
>>
>> The version I sent returns in O(1), if performance was your concern.
>> Or
>> did you mean something else?
>
> Since 0 is the only document number in the index, a
>
> return target == 0;
>
> might be nice for skipTo(). It doesn't really help performance, though,
> and the next() works just as well.
>
> Regards,
> Paul Elschot.
>


It's not just "return target == 0". Internally next() switches a
hasNext flag to false, and that makes it a safer operation...

BTW, did you give the unit tests a shot? Or even better, run it against
some of your own queries/test data? That might help to shake out other
bugs that might potentially be lurking in remote corners...

Cheers,
Wolfgang.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Erik Hatcher
In reply to this post by Wolfgang Hoschek

On May 2, 2005, at 5:21 PM, Wolfgang Hoschek wrote:

> Finally found and fixed the bug!
> The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo()  
> with the following:
>
>                 public boolean skipTo(int target) {
>                     if (DEBUG) System.err.println(".skipTo: " +  
> target);
>                     return next();
>                 }
>
> Apparently lucene-1.4.3 didn't use skipTo() in a way that triggered  
> the bug, while SVN does.

I've committed this change after it successfully worked for me.

Thanks!

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
Thanks!
Wolfgang.

> I've committed this change after it successfully worked for me.
>
> Thanks!
>
>     Erik
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek
Here's a performance patch for MemoryIndex.MemoryIndexReader that
caches the norms for a given field, avoiding repeated recomputation of
the norms. Recall that, depending on the query, norms() can be called
over and over again with mostly the same parameters. Thus, replace
public byte[] norms(String fieldName) with the following code:

                /** performance hack: cache norms to avoid repeated expensive
calculations */
                private byte[] cachedNorms;
                private String cachedFieldName;
                private Similarity cachedSimilarity;
               
                public byte[] norms(String fieldName) {
                        byte[] norms = cachedNorms;
                        Similarity sim = getSimilarity();
                        if (fieldName != cachedFieldName || sim != cachedSimilarity) { //
not cached?
                                Info info = getInfo(fieldName);
                                int numTokens = info != null ? info.numTokens : 0;
                                float n = sim.lengthNorm(fieldName, numTokens);
                                byte norm = Similarity.encodeNorm(n);
                                norms = new byte[] {norm};
                               
                                cachedNorms = norms;
                                cachedFieldName = fieldName;
                                cachedSimilarity = sim;
                                if (DEBUG) System.err.println("MemoryIndexReader.norms: " +
fieldName + ":" + n + ":" + norm + ":" + numTokens);
                        }
                        return norms;
                }


The effect can be substantial when measured with the profiler, so it's
worth it.
Wolfgang.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Performance] Streaming main memory indexing of single strings

Erik Hatcher
Applied!!

     Erik

On May 3, 2005, at 1:31 PM, Wolfgang Hoschek wrote:

> Here's a performance patch for MemoryIndex.MemoryIndexReader that  
> caches the norms for a given field, avoiding repeated recomputation  
> of the norms. Recall that, depending on the query, norms() can be  
> called over and over again with mostly the same parameters. Thus,  
> replace public byte[] norms(String fieldName) with the following code:
>
>         /** performance hack: cache norms to avoid repeated  
> expensive calculations */
>         private byte[] cachedNorms;
>         private String cachedFieldName;
>         private Similarity cachedSimilarity;
>
>         public byte[] norms(String fieldName) {
>             byte[] norms = cachedNorms;
>             Similarity sim = getSimilarity();
>             if (fieldName != cachedFieldName || sim !=  
> cachedSimilarity) { // not cached?
>                 Info info = getInfo(fieldName);
>                 int numTokens = info != null ? info.numTokens : 0;
>                 float n = sim.lengthNorm(fieldName, numTokens);
>                 byte norm = Similarity.encodeNorm(n);
>                 norms = new byte[] {norm};
>
>                 cachedNorms = norms;
>                 cachedFieldName = fieldName;
>                 cachedSimilarity = sim;
>                 if (DEBUG) System.err.println
> ("MemoryIndexReader.norms: " + fieldName + ":" + n + ":" + norm +  
> ":" + numTokens);
>             }
>             return norms;
>         }
>
>
> The effect can be substantial when measured with the profiler, so  
> it's worth it.
> Wolfgang.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...