feedback: Indexing speed improvement lucene 2.2->2.3.1

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

feedback: Indexing speed improvement lucene 2.2->2.3.1

Uwe Goetzke

This week I switched the lucene library version on one customer system.
The indexing speed went down from 46m32s to 16m20s for the complete task
including optimisation. Great Job!
We index product catalogs from several suppliers, in this case around
56.000 product groups and 360.000 products including descriptions were
indexed.

Regards

Uwe



-----------------------------------------------------------------------
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Michael McCandless-2

That is a nice result -- thanks for reporting this Uwe!

Mike

On Mar 1, 2008, at 3:45 AM, Uwe Goetzke wrote:

>
> This week I switched the lucene library version on one customer  
> system.
> The indexing speed went down from 46m32s to 16m20s for the complete  
> task
> including optimisation. Great Job!
> We index product catalogs from several suppliers, in this case around
> 56.000 product groups and 360.000 products including descriptions were
> indexed.
>
> Regards
>
> Uwe
>
>
>
> ----------------------------------------------------------------------
> -
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte  
> Empfanger sind, durfen Sie die Informationen nicht offen legen oder  
> benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben,  
> teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an  
> den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient,  
> you must not disclose or use this information contained in it. If  
> you have received this email in error please tell us immediately by  
> return email and delete the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Ivan Vasilev-2
In reply to this post by Uwe Goetzke
Hi Uwe,

Could you tell what Analyzer do you use when you marked so big indexing
speedup?
If you use StandardAnalyzer (that uses StandardTokenizer) may be the
reason is in it. You can see the pre last report in the thread "Indexing
Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
that now is generated by JFlex instead of JavaCC.
I am asking because I noticed a great speedup in adding documents to
index in our system. We have time control on this in the debug mode. NOW
THEY ARE ADDED 5 TIMES FASTER!!!
But in the same time the total process of indexing in our case has
improvement of about 8%. As our system is very big and complex I am
wondering if really the whole process of indexing is reduces so
remarkably and our system causes this slowdown or may be Lucene does
some optimizations on the index, merges or something else and this is
the reason the total process of indexing to be not so reasonably faster.

Best Regards,
Ivan



Uwe Goetzke wrote:

> This week I switched the lucene library version on one customer system.
> The indexing speed went down from 46m32s to 16m20s for the complete task
> including optimisation. Great Job!
> We index product catalogs from several suppliers, in this case around
> 56.000 product groups and 360.000 products including descriptions were
> indexed.
>
> Regards
>
> Uwe
>
>
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> __________ NOD32 2913 (20080301) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Uwe Goetzke
Hi Ivan,
No, we do not use StandardAnalyser or StandardTokenizer.

Most data is processed by
        fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
        result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
        result = new org.apache.lucene.analysis.LowerCaseFilter(result);
        result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer

We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.

Best Regards

Uwe

-----Ursprüngliche Nachricht-----
Von: Ivan Vasilev [mailto:[hidden email]]
Gesendet: Freitag, 21. März 2008 16:25
An: [hidden email]
Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

Could you tell what Analyzer do you use when you marked so big indexing
speedup?
If you use StandardAnalyzer (that uses StandardTokenizer) may be the
reason is in it. You can see the pre last report in the thread "Indexing
Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
that now is generated by JFlex instead of JavaCC.
I am asking because I noticed a great speedup in adding documents to
index in our system. We have time control on this in the debug mode. NOW
THEY ARE ADDED 5 TIMES FASTER!!!
But in the same time the total process of indexing in our case has
improvement of about 8%. As our system is very big and complex I am
wondering if really the whole process of indexing is reduces so
remarkably and our system causes this slowdown or may be Lucene does
some optimizations on the index, merges or something else and this is
the reason the total process of indexing to be not so reasonably faster.

Best Regards,
Ivan



Uwe Goetzke wrote:

> This week I switched the lucene library version on one customer system.
> The indexing speed went down from 46m32s to 16m20s for the complete task
> including optimisation. Great Job!
> We index product catalogs from several suppliers, in this case around
> 56.000 product groups and 360.000 products including descriptions were
> indexed.
>
> Regards
>
> Uwe
>
>
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> __________ NOD32 2913 (20080301) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


-----------------------------------------------------------------------
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Michael McCandless-2

Ivan can you describe more about your application?

The overall time for indexing has gotten much faster in 2.3, but this  
is assuming things like retrieving a document from its original  
source, filtering it, etc, are minimal.  If you have an application  
where most of the time is spent outside Lucene then the 2.3 speedups  
won't result in a very large speedup for your application.

Mike

Uwe Goetzke wrote:

> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
>
> Most data is processed by
> fTextTokenStream = result = new  
> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
> result = new ISOLatin2AccentFilter(result); //  
> ISOLatin1AccentFilter  modified that ö -> oe
> result = new org.apache.lucene.analysis.LowerCaseFilter(result);
> result = new org.apache.lucene.analysis.NGramStemFilter(result,
> 2); //just a bigram tokenizer
>
> We use our own queryparser. The bigramms are searched with a  
> tolerant phrase query, scoring in a doc the greatest bigramms  
> clusters covering the phrase token.
>
> Best Regards
>
> Uwe
>
> -----Ursprüngliche Nachricht-----
> Von: Ivan Vasilev [mailto:[hidden email]]
> Gesendet: Freitag, 21. März 2008 16:25
> An: [hidden email]
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> Could you tell what Analyzer do you use when you marked so big  
> indexing
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
> reason is in it. You can see the pre last report in the thread  
> "Indexing
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter  
> Jake
> Mannix this is because now StandardTokenizer uses  
> StandardTokenizerImpl
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to
> index in our system. We have time control on this in the debug  
> mode. NOW
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has
> improvement of about 8%. As our system is very big and complex I am
> wondering if really the whole process of indexing is reduces so
> remarkably and our system causes this slowdown or may be Lucene does
> some optimizations on the index, merges or something else and this is
> the reason the total process of indexing to be not so reasonably  
> faster.
>
> Best Regards,
> Ivan
>
>
>
> Uwe Goetzke wrote:
>> This week I switched the lucene library version on one customer  
>> system.
>> The indexing speed went down from 46m32s to 16m20s for the  
>> complete task
>> including optimisation. Great Job!
>> We index product catalogs from several suppliers, in this case around
>> 56.000 product groups and 360.000 products including descriptions  
>> were
>> indexed.
>>
>> Regards
>>
>> Uwe
>>
>>
>>
>> ---------------------------------------------------------------------
>> --
>> Healy Hudson GmbH - D-55252 Mainz Kastel
>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>
>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte  
>> Empfanger sind, durfen Sie die Informationen nicht offen legen  
>> oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen  
>> haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese  
>> Email an den Absender zuruckschicken. Bitte loschen Sie danach  
>> diese Email.
>> This email is confidential. If you are not the intended recipient,  
>> you must not disclose or use this information contained in it. If  
>> you have received this email in error please tell us immediately  
>> by return email and delete the document.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>> __________ NOD32 2913 (20080301) Information __________
>>
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ----------------------------------------------------------------------
> -
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte  
> Empfänger sind, dürfen Sie die Informationen nicht offen legen oder  
> benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben,  
> teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an  
> den Absender zurückschicken. Bitte löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient,  
> you must not disclose or use this information contained in it. If  
> you have received this email in error please tell us immediately by  
> return email and delete the document.
>
>
> ---------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Ivan Vasilev-2
Yes Michael it seems our app takes time to retrieve docs from their
sources. I have to run some profiler tool to see where is the bottleneck
in our case.
Thanks to you and Uwe for the answers!

Ivan

Michael McCandless wrote:

>
> Ivan can you describe more about your application?
>
> The overall time for indexing has gotten much faster in 2.3, but this
> is assuming things like retrieving a document from its original
> source, filtering it, etc, are minimal.  If you have an application
> where most of the time is spent outside Lucene then the 2.3 speedups
> won't result in a very large speedup for your application.
>
> Mike
>
> Uwe Goetzke wrote:
>> Hi Ivan,
>> No, we do not use StandardAnalyser or StandardTokenizer.
>>
>> Most data is processed by
>>     fTextTokenStream = result = new
>> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>     result = new ISOLatin2AccentFilter(result); //
>> ISOLatin1AccentFilter  modified that ö -> oe
>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>     result = new
>> org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram
>> tokenizer
>>
>> We use our own queryparser. The bigramms are searched with a tolerant
>> phrase query, scoring in a doc the greatest bigramms clusters
>> covering the phrase token.
>>
>> Best Regards
>>
>> Uwe
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Ivan Vasilev [mailto:[hidden email]]
>> Gesendet: Freitag, 21. März 2008 16:25
>> An: [hidden email]
>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Hi Uwe,
>>
>> Could you tell what Analyzer do you use when you marked so big indexing
>> speedup?
>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
>> reason is in it. You can see the pre last report in the thread "Indexing
>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
>> that now is generated by JFlex instead of JavaCC.
>> I am asking because I noticed a great speedup in adding documents to
>> index in our system. We have time control on this in the debug mode. NOW
>> THEY ARE ADDED 5 TIMES FASTER!!!
>> But in the same time the total process of indexing in our case has
>> improvement of about 8%. As our system is very big and complex I am
>> wondering if really the whole process of indexing is reduces so
>> remarkably and our system causes this slowdown or may be Lucene does
>> some optimizations on the index, merges or something else and this is
>> the reason the total process of indexing to be not so reasonably faster.
>>
>> Best Regards,
>> Ivan
>>
>>
>>
>> Uwe Goetzke wrote:
>>> This week I switched the lucene library version on one customer system.
>>> The indexing speed went down from 46m32s to 16m20s for the complete
>>> task
>>> including optimisation. Great Job!
>>> We index product catalogs from several suppliers, in this case around
>>> 56.000 product groups and 360.000 products including descriptions were
>>> indexed.
>>>
>>> Regards
>>>
>>> Uwe
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>>
>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte
>>> Empfanger sind, durfen Sie die Informationen nicht offen legen oder
>>> benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben,
>>> teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den
>>> Absender zuruckschicken. Bitte loschen Sie danach diese Email.
>>> This email is confidential. If you are not the intended recipient,
>>> you must not disclose or use this information contained in it. If
>>> you have received this email in error please tell us immediately by
>>> return email and delete the document.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>> __________ NOD32 2913 (20080301) Information __________
>>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>> -----------------------------------------------------------------------
>> Healy Hudson GmbH - D-55252 Mainz Kastel
>> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>>
>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte
>> Empfänger sind, dürfen Sie die Informationen nicht offen legen oder
>> benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben,
>> teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den
>> Absender zurückschicken. Bitte löschen Sie danach diese Email.
>> This email is confidential. If you are not the intended recipient,
>> you must not disclose or use this information contained in it. If you
>> have received this email in error please tell us immediately by
>> return email and delete the document.
>>
>>
>> ---------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> __________ NOD32 2968 (20080324) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Jake Mannix
In reply to this post by Uwe Goetzke
Uwe,
  This is a little off thread-topic, but I was wondering how your
search relevance and search performance has fared with this
bigram-based index.  Is it significantly better than before you use
the NGramAnalyzer?
   -jake



On 3/24/08, Uwe Goetzke <[hidden email]> wrote:

> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
>
> Most data is processed by
> fTextTokenStream = result = new
> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
> result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter
> modified that ö -> oe
> result = new org.apache.lucene.analysis.LowerCaseFilter(result);
> result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a
> bigram tokenizer
>
> We use our own queryparser. The bigramms are searched with a tolerant phrase
> query, scoring in a doc the greatest bigramms clusters covering the phrase
> token.
>
> Best Regards
>
> Uwe
>
> -----Ursprüngliche Nachricht-----
> Von: Ivan Vasilev [mailto:[hidden email]]
> Gesendet: Freitag, 21. März 2008 16:25
> An: [hidden email]
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> Could you tell what Analyzer do you use when you marked so big indexing
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
> reason is in it. You can see the pre last report in the thread "Indexing
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to
> index in our system. We have time control on this in the debug mode. NOW
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has
> improvement of about 8%. As our system is very big and complex I am
> wondering if really the whole process of indexing is reduces so
> remarkably and our system causes this slowdown or may be Lucene does
> some optimizations on the index, merges or something else and this is
> the reason the total process of indexing to be not so reasonably faster.
>
> Best Regards,
> Ivan
>
>
>
> Uwe Goetzke wrote:
> > This week I switched the lucene library version on one customer system.
> > The indexing speed went down from 46m32s to 16m20s for the complete task
> > including optimisation. Great Job!
> > We index product catalogs from several suppliers, in this case around
> > 56.000 product groups and 360.000 products including descriptions were
> > indexed.
> >
> > Regards
> >
> > Uwe
> >
> >
> >
> > -----------------------------------------------------------------------
> > Healy Hudson GmbH - D-55252 Mainz Kastel
> > Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
> >
> > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger
> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte
> loschen Sie danach diese Email.
> > This email is confidential. If you are not the intended recipient, you
> must not disclose or use this information contained in it. If you have
> received this email in error please tell us immediately by return email and
> delete the document.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> > __________ NOD32 2913 (20080301) Information __________
> >
> > This message was checked by NOD32 antivirus system.
> > http://www.eset.com
> >
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger
> sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte
> löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must
> not disclose or use this information contained in it. If you have received
> this email in error please tell us immediately by return email and delete
> the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
Sent from Gmail for mobile | mobile.google.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Uwe Goetzke
Jake,

With the bigram-based index we gave up for the struggle to find a well working language based index.
We had implemented soundex (or different "sound"-alikes) and hyphenating but failed to deliver a user explainable search result ("why is this ranked higher" and so on...). One reason may be that product descriptions contain a lot of abbreviations.

The index size grew about 30%.
The search performance seems a bit slower but I no concrete figures. The evaluation for a for one document is a bit more complex than a phrase query. One reason of course is that there a more terms evaluated. But nevertheless it is quite good.

The search relevance improved tremendously. Missing characters, switched letters and partial word fragments are no real problems any more (of course dependent on the length of the search word).
Search term "weekday" finds also "day of the week", "disabigaute" finds "disambiguate".
The algorithms I developed might not fit other domains but for multi language catalogs of products it works quite well for us. So far...


Regards Uwe

-----Ursprüngliche Nachricht-----
Von: Jake Mannix [mailto:[hidden email]]
Gesendet: Dienstag, 25. März 2008 17:13
An: [hidden email]
Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Uwe,
  This is a little off thread-topic, but I was wondering how your
search relevance and search performance has fared with this
bigram-based index.  Is it significantly better than before you use
the NGramAnalyzer?
   -jake



On 3/24/08, Uwe Goetzke <[hidden email]> wrote:

> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
>
> Most data is processed by
> fTextTokenStream = result = new
> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
> result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter
> modified that ö -> oe
> result = new org.apache.lucene.analysis.LowerCaseFilter(result);
> result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a
> bigram tokenizer
>
> We use our own queryparser. The bigramms are searched with a tolerant phrase
> query, scoring in a doc the greatest bigramms clusters covering the phrase
> token.
>
> Best Regards
>
> Uwe
>
> -----Ursprüngliche Nachricht-----
> Von: Ivan Vasilev [mailto:[hidden email]]
> Gesendet: Freitag, 21. März 2008 16:25
> An: [hidden email]
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> Could you tell what Analyzer do you use when you marked so big indexing
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
> reason is in it. You can see the pre last report in the thread "Indexing
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to
> index in our system. We have time control on this in the debug mode. NOW
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has
> improvement of about 8%. As our system is very big and complex I am
> wondering if really the whole process of indexing is reduces so
> remarkably and our system causes this slowdown or may be Lucene does
> some optimizations on the index, merges or something else and this is
> the reason the total process of indexing to be not so reasonably faster.
>
> Best Regards,
> Ivan
>
>
>
> Uwe Goetzke wrote:
> > This week I switched the lucene library version on one customer system.
> > The indexing speed went down from 46m32s to 16m20s for the complete task
> > including optimisation. Great Job!
> > We index product catalogs from several suppliers, in this case around
> > 56.000 product groups and 360.000 products including descriptions were
> > indexed.
> >
> > Regards
> >
> > Uwe
> >
> >
> >
> > -----------------------------------------------------------------------
> > Healy Hudson GmbH - D-55252 Mainz Kastel
> > Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
> >
> > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger
> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte
> loschen Sie danach diese Email.
> > This email is confidential. If you are not the intended recipient, you
> must not disclose or use this information contained in it. If you have
> received this email in error please tell us immediately by return email and
> delete the document.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> > __________ NOD32 2913 (20080301) Information __________
> >
> > This message was checked by NOD32 antivirus system.
> > http://www.eset.com
> >
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger
> sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte
> löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must
> not disclose or use this information contained in it. If you have received
> this email in error please tell us immediately by return email and delete
> the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
Sent from Gmail for mobile | mobile.google.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Jay-98
In reply to this post by Uwe Goetzke
Hi Uwe,

I am curious what NGramStemFilter is? Is it a combination of porter
stemming and word ngram identification?

Thanks!

Jay

Uwe Goetzke wrote:

> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
>
> Most data is processed by
> fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
> result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
> result = new org.apache.lucene.analysis.LowerCaseFilter(result);
> result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer
>
> We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.
>
> Best Regards
>
> Uwe
>
> -----Ursprüngliche Nachricht-----
> Von: Ivan Vasilev [mailto:[hidden email]]
> Gesendet: Freitag, 21. März 2008 16:25
> An: [hidden email]
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> Could you tell what Analyzer do you use when you marked so big indexing
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
> reason is in it. You can see the pre last report in the thread "Indexing
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to
> index in our system. We have time control on this in the debug mode. NOW
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has
> improvement of about 8%. As our system is very big and complex I am
> wondering if really the whole process of indexing is reduces so
> remarkably and our system causes this slowdown or may be Lucene does
> some optimizations on the index, merges or something else and this is
> the reason the total process of indexing to be not so reasonably faster.
>
> Best Regards,
> Ivan
>
>
>
> Uwe Goetzke wrote:
>> This week I switched the lucene library version on one customer system.
>> The indexing speed went down from 46m32s to 16m20s for the complete task
>> including optimisation. Great Job!
>> We index product catalogs from several suppliers, in this case around
>> 56.000 product groups and 360.000 products including descriptions were
>> indexed.
>>
>> Regards
>>
>> Uwe
>>
>>
>>
>> -----------------------------------------------------------------------
>> Healy Hudson GmbH - D-55252 Mainz Kastel
>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>
>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>> __________ NOD32 2913 (20080301) Information __________
>>
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>>
>>
>>
>>  
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Otis Gospodnetic-2
In reply to this post by Uwe Goetzke
Jay,

Have a look at Lucene config, it's all there, including tests.  This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram size.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Jay <[hidden email]>
To: [hidden email]
Sent: Tuesday, March 25, 2008 1:32:24 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

I am curious what NGramStemFilter is? Is it a combination of porter
stemming and word ngram identification?

Thanks!

Jay

Uwe Goetzke wrote:

> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
>
> Most data is processed by
>     fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer
>
> We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.
>
> Best Regards
>
> Uwe
>
> -----Ursprüngliche Nachricht-----
> Von: Ivan Vasilev [mailto:[hidden email]]
> Gesendet: Freitag, 21. März 2008 16:25
> An: [hidden email]
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> Could you tell what Analyzer do you use when you marked so big indexing
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
> reason is in it. You can see the pre last report in the thread "Indexing
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to
> index in our system. We have time control on this in the debug mode. NOW
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has
> improvement of about 8%. As our system is very big and complex I am
> wondering if really the whole process of indexing is reduces so
> remarkably and our system causes this slowdown or may be Lucene does
> some optimizations on the index, merges or something else and this is
> the reason the total process of indexing to be not so reasonably faster.
>
> Best Regards,
> Ivan
>
>
>
> Uwe Goetzke wrote:
>> This week I switched the lucene library version on one customer system.
>> The indexing speed went down from 46m32s to 16m20s for the complete task
>> including optimisation. Great Job!
>> We index product catalogs from several suppliers, in this case around
>> 56.000 product groups and 360.000 products including descriptions were
>> indexed.
>>
>> Regards
>>
>> Uwe
>>
>>
>>
>> -----------------------------------------------------------------------
>> Healy Hudson GmbH - D-55252 Mainz Kastel
>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>
>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>> __________ NOD32 2913 (20080301) Information __________
>>
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>>
>>
>>
>>  
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Jay-98
Sorry, I could not find the filter in the 2.3 API class list (core +
contrib + test). I am not ware of lucene config file either. Could you
please tell me where it is in 2.3 release?

Thanks!

Jay

Otis Gospodnetic wrote:

> Jay,
>
> Have a look at Lucene config, it's all there, including tests.  This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram size.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Jay <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, March 25, 2008 1:32:24 PM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> I am curious what NGramStemFilter is? Is it a combination of porter
> stemming and word ngram identification?
>
> Thanks!
>
> Jay
>
> Uwe Goetzke wrote:
>> Hi Ivan,
>> No, we do not use StandardAnalyser or StandardTokenizer.
>>
>> Most data is processed by
>>     fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer
>>
>> We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.
>>
>> Best Regards
>>
>> Uwe
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Ivan Vasilev [mailto:[hidden email]]
>> Gesendet: Freitag, 21. März 2008 16:25
>> An: [hidden email]
>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Hi Uwe,
>>
>> Could you tell what Analyzer do you use when you marked so big indexing
>> speedup?
>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
>> reason is in it. You can see the pre last report in the thread "Indexing
>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
>> that now is generated by JFlex instead of JavaCC.
>> I am asking because I noticed a great speedup in adding documents to
>> index in our system. We have time control on this in the debug mode. NOW
>> THEY ARE ADDED 5 TIMES FASTER!!!
>> But in the same time the total process of indexing in our case has
>> improvement of about 8%. As our system is very big and complex I am
>> wondering if really the whole process of indexing is reduces so
>> remarkably and our system causes this slowdown or may be Lucene does
>> some optimizations on the index, merges or something else and this is
>> the reason the total process of indexing to be not so reasonably faster.
>>
>> Best Regards,
>> Ivan
>>
>>
>>
>> Uwe Goetzke wrote:
>>> This week I switched the lucene library version on one customer system.
>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>> including optimisation. Great Job!
>>> We index product catalogs from several suppliers, in this case around
>>> 56.000 product groups and 360.000 products including descriptions were
>>> indexed.
>>>
>>> Regards
>>>
>>> Uwe
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>>
>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
>>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>> __________ NOD32 2913 (20080301) Information __________
>>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>>
>>>
>>>
>>>  
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>> -----------------------------------------------------------------------
>> Healy Hudson GmbH - D-55252 Mainz Kastel
>> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>>
>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Otis Gospodnetic-2
In reply to this post by Uwe Goetzke
Hi Jay,

Sorry, lapsus calami, that would be Lucene *contrib*.
Have a look:
http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Jay <[hidden email]>
To: [hidden email]
Sent: Tuesday, March 25, 2008 6:15:54 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Sorry, I could not find the filter in the 2.3 API class list (core +
contrib + test). I am not ware of lucene config file either. Could you
please tell me where it is in 2.3 release?

Thanks!

Jay

Otis Gospodnetic wrote:

> Jay,
>
> Have a look at Lucene config, it's all there, including tests.  This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram size.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Jay <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, March 25, 2008 1:32:24 PM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> I am curious what NGramStemFilter is? Is it a combination of porter
> stemming and word ngram identification?
>
> Thanks!
>
> Jay
>
> Uwe Goetzke wrote:
>> Hi Ivan,
>> No, we do not use StandardAnalyser or StandardTokenizer.
>>
>> Most data is processed by
>>     fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer
>>
>> We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.
>>
>> Best Regards
>>
>> Uwe
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Ivan Vasilev [mailto:[hidden email]]
>> Gesendet: Freitag, 21. März 2008 16:25
>> An: [hidden email]
>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Hi Uwe,
>>
>> Could you tell what Analyzer do you use when you marked so big indexing
>> speedup?
>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
>> reason is in it. You can see the pre last report in the thread "Indexing
>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
>> that now is generated by JFlex instead of JavaCC.
>> I am asking because I noticed a great speedup in adding documents to
>> index in our system. We have time control on this in the debug mode. NOW
>> THEY ARE ADDED 5 TIMES FASTER!!!
>> But in the same time the total process of indexing in our case has
>> improvement of about 8%. As our system is very big and complex I am
>> wondering if really the whole process of indexing is reduces so
>> remarkably and our system causes this slowdown or may be Lucene does
>> some optimizations on the index, merges or something else and this is
>> the reason the total process of indexing to be not so reasonably faster.
>>
>> Best Regards,
>> Ivan
>>
>>
>>
>> Uwe Goetzke wrote:
>>> This week I switched the lucene library version on one customer system.
>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>> including optimisation. Great Job!
>>> We index product catalogs from several suppliers, in this case around
>>> 56.000 product groups and 360.000 products including descriptions were
>>> indexed.
>>>
>>> Regards
>>>
>>> Uwe
>>>
>>>
>>>
>>> -----------------------------------------------------------------------
>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>>
>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
>>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>> __________ NOD32 2913 (20080301) Information __________
>>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>>
>>>
>>>
>>>  
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>> -----------------------------------------------------------------------
>> Healy Hudson GmbH - D-55252 Mainz Kastel
>> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>>
>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Jay-98
Hi Otis,
I checked that contrib before and could not find NgramStemFilter. Am I
missing other contrib?
Thanks for the link!

Jay

Otis Gospodnetic wrote:

> Hi Jay,
>
> Sorry, lapsus calami, that would be Lucene *contrib*.
> Have a look:
> http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Jay <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, March 25, 2008 6:15:54 PM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Sorry, I could not find the filter in the 2.3 API class list (core +
> contrib + test). I am not ware of lucene config file either. Could you
> please tell me where it is in 2.3 release?
>
> Thanks!
>
> Jay
>
> Otis Gospodnetic wrote:
>  
>> Jay,
>>
>> Have a look at Lucene config, it's all there, including tests.  This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram size.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: Jay <[hidden email]>
>> To: [hidden email]
>> Sent: Tuesday, March 25, 2008 1:32:24 PM
>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Hi Uwe,
>>
>> I am curious what NGramStemFilter is? Is it a combination of porter
>> stemming and word ngram identification?
>>
>> Thanks!
>>
>> Jay
>>
>> Uwe Goetzke wrote:
>>    
>>> Hi Ivan,
>>> No, we do not use StandardAnalyser or StandardTokenizer.
>>>
>>> Most data is processed by
>>>     fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
>>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer
>>>
>>> We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.
>>>
>>> Best Regards
>>>
>>> Uwe
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Ivan Vasilev [mailto:[hidden email]]
>>> Gesendet: Freitag, 21. März 2008 16:25
>>> An: [hidden email]
>>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>
>>> Hi Uwe,
>>>
>>> Could you tell what Analyzer do you use when you marked so big indexing
>>> speedup?
>>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
>>> reason is in it. You can see the pre last report in the thread "Indexing
>>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
>>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
>>> that now is generated by JFlex instead of JavaCC.
>>> I am asking because I noticed a great speedup in adding documents to
>>> index in our system. We have time control on this in the debug mode. NOW
>>> THEY ARE ADDED 5 TIMES FASTER!!!
>>> But in the same time the total process of indexing in our case has
>>> improvement of about 8%. As our system is very big and complex I am
>>> wondering if really the whole process of indexing is reduces so
>>> remarkably and our system causes this slowdown or may be Lucene does
>>> some optimizations on the index, merges or something else and this is
>>> the reason the total process of indexing to be not so reasonably faster.
>>>
>>> Best Regards,
>>> Ivan
>>>
>>>
>>>
>>> Uwe Goetzke wrote:
>>>      
>>>> This week I switched the lucene library version on one customer system.
>>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>>> including optimisation. Great Job!
>>>> We index product catalogs from several suppliers, in this case around
>>>> 56.000 product groups and 360.000 products including descriptions were
>>>> indexed.
>>>>
>>>> Regards
>>>>
>>>> Uwe
>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------------------
>>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>>>
>>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
>>>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>> __________ NOD32 2913 (20080301) Information __________
>>>>
>>>> This message was checked by NOD32 antivirus system.
>>>> http://www.eset.com
>>>>
>>>>
>>>>
>>>>  
>>>>        
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>> -----------------------------------------------------------------------
>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>>>
>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
>>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>      
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Otis Gospodnetic-2
In reply to this post by Uwe Goetzke
Sorry, I wrote this stuff, but forgot the naming.
Look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: yu <[hidden email]>
To: [hidden email]
Sent: Wednesday, March 26, 2008 12:04:33 AM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Otis,
I checked that contrib before and could not find NgramStemFilter. Am I
missing other contrib?
Thanks for the link!

Jay

Otis Gospodnetic wrote:

> Hi Jay,
>
> Sorry, lapsus calami, that would be Lucene *contrib*.
> Have a look:
> http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Jay <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, March 25, 2008 6:15:54 PM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Sorry, I could not find the filter in the 2.3 API class list (core +
> contrib + test). I am not ware of lucene config file either. Could you
> please tell me where it is in 2.3 release?
>
> Thanks!
>
> Jay
>
> Otis Gospodnetic wrote:
>  
>> Jay,
>>
>> Have a look at Lucene config, it's all there, including tests.  This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram size.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: Jay <[hidden email]>
>> To: [hidden email]
>> Sent: Tuesday, March 25, 2008 1:32:24 PM
>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Hi Uwe,
>>
>> I am curious what NGramStemFilter is? Is it a combination of porter
>> stemming and word ngram identification?
>>
>> Thanks!
>>
>> Jay
>>
>> Uwe Goetzke wrote:
>>    
>>> Hi Ivan,
>>> No, we do not use StandardAnalyser or StandardTokenizer.
>>>
>>> Most data is processed by
>>>     fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
>>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer
>>>
>>> We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.
>>>
>>> Best Regards
>>>
>>> Uwe
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Ivan Vasilev [mailto:[hidden email]]
>>> Gesendet: Freitag, 21. März 2008 16:25
>>> An: [hidden email]
>>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>
>>> Hi Uwe,
>>>
>>> Could you tell what Analyzer do you use when you marked so big indexing
>>> speedup?
>>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
>>> reason is in it. You can see the pre last report in the thread "Indexing
>>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
>>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
>>> that now is generated by JFlex instead of JavaCC.
>>> I am asking because I noticed a great speedup in adding documents to
>>> index in our system. We have time control on this in the debug mode. NOW
>>> THEY ARE ADDED 5 TIMES FASTER!!!
>>> But in the same time the total process of indexing in our case has
>>> improvement of about 8%. As our system is very big and complex I am
>>> wondering if really the whole process of indexing is reduces so
>>> remarkably and our system causes this slowdown or may be Lucene does
>>> some optimizations on the index, merges or something else and this is
>>> the reason the total process of indexing to be not so reasonably faster.
>>>
>>> Best Regards,
>>> Ivan
>>>
>>>
>>>
>>> Uwe Goetzke wrote:
>>>      
>>>> This week I switched the lucene library version on one customer system.
>>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>>> including optimisation. Great Job!
>>>> We index product catalogs from several suppliers, in this case around
>>>> 56.000 product groups and 360.000 products including descriptions were
>>>> indexed.
>>>>
>>>> Regards
>>>>
>>>> Uwe
>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------------------
>>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>>>
>>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
>>>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>> __________ NOD32 2913 (20080301) Information __________
>>>>
>>>> This message was checked by NOD32 antivirus system.
>>>> http://www.eset.com
>>>>
>>>>
>>>>
>>>>  
>>>>        
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>> -----------------------------------------------------------------------
>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>>>
>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
>>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>      
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Jay-98
Sorry for my ignorance, I am looking for

NgramStemFilter specifically.
Are you suggesting that it's the same as NGramTokenFilter? Does it have stemming in it?

Thanks again.

Jay


Otis Gospodnetic wrote:

> Sorry, I wrote this stuff, but forgot the naming.
> Look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: yu <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, March 26, 2008 12:04:33 AM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Otis,
> I checked that contrib before and could not find NgramStemFilter. Am I
> missing other contrib?
> Thanks for the link!
>
> Jay
>
> Otis Gospodnetic wrote:
>  
>> Hi Jay,
>>
>> Sorry, lapsus calami, that would be Lucene *contrib*.
>> Have a look:
>> http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: Jay <[hidden email]>
>> To: [hidden email]
>> Sent: Tuesday, March 25, 2008 6:15:54 PM
>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Sorry, I could not find the filter in the 2.3 API class list (core +
>> contrib + test). I am not ware of lucene config file either. Could you
>> please tell me where it is in 2.3 release?
>>
>> Thanks!
>>
>> Jay
>>
>> Otis Gospodnetic wrote:
>>  
>>    
>>> Jay,
>>>
>>> Have a look at Lucene config, it's all there, including tests.  This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram size.
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>> ----- Original Message ----
>>> From: Jay <[hidden email]>
>>> To: [hidden email]
>>> Sent: Tuesday, March 25, 2008 1:32:24 PM
>>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>
>>> Hi Uwe,
>>>
>>> I am curious what NGramStemFilter is? Is it a combination of porter
>>> stemming and word ngram identification?
>>>
>>> Thanks!
>>>
>>> Jay
>>>
>>> Uwe Goetzke wrote:
>>>    
>>>      
>>>> Hi Ivan,
>>>> No, we do not use StandardAnalyser or StandardTokenizer.
>>>>
>>>> Most data is processed by
>>>>     fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>>>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
>>>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>>>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer
>>>>
>>>> We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.
>>>>
>>>> Best Regards
>>>>
>>>> Uwe
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Ivan Vasilev [mailto:[hidden email]]
>>>> Gesendet: Freitag, 21. März 2008 16:25
>>>> An: [hidden email]
>>>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>>
>>>> Hi Uwe,
>>>>
>>>> Could you tell what Analyzer do you use when you marked so big indexing
>>>> speedup?
>>>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
>>>> reason is in it. You can see the pre last report in the thread "Indexing
>>>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
>>>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
>>>> that now is generated by JFlex instead of JavaCC.
>>>> I am asking because I noticed a great speedup in adding documents to
>>>> index in our system. We have time control on this in the debug mode. NOW
>>>> THEY ARE ADDED 5 TIMES FASTER!!!
>>>> But in the same time the total process of indexing in our case has
>>>> improvement of about 8%. As our system is very big and complex I am
>>>> wondering if really the whole process of indexing is reduces so
>>>> remarkably and our system causes this slowdown or may be Lucene does
>>>> some optimizations on the index, merges or something else and this is
>>>> the reason the total process of indexing to be not so reasonably faster.
>>>>
>>>> Best Regards,
>>>> Ivan
>>>>
>>>>
>>>>
>>>> Uwe Goetzke wrote:
>>>>      
>>>>        
>>>>> This week I switched the lucene library version on one customer system.
>>>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>>>> including optimisation. Great Job!
>>>>> We index product catalogs from several suppliers, in this case around
>>>>> 56.000 product groups and 360.000 products including descriptions were
>>>>> indexed.
>>>>>
>>>>> Regards
>>>>>
>>>>> Uwe
>>>>>
>>>>>
>>>>>
>>>>> -----------------------------------------------------------------------
>>>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>>>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>>>>
>>>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie danach diese Email.
>>>>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>> __________ NOD32 2913 (20080301) Information __________
>>>>>
>>>>> This message was checked by NOD32 antivirus system.
>>>>> http://www.eset.com
>>>>>
>>>>>
>>>>>
>>>>>  
>>>>>        
>>>>>          
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>> -----------------------------------------------------------------------
>>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>>> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>>>>
>>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
>>>> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>      
>>>>        
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>    
>>>      
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>  
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

AW: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Uwe Goetzke
Hi Jay,

Sorry for the confusion, I wrote NgramStemFilter in an early stage of the project which is essentially the same as NGramTokenFilter from Otis with the addition that I add begin and end token markers (e.g. word gets and _word_ and so  _w wo rd d_ ).

As I modified a lot of our lucene code which we developed since lucene version 1.2 to move to a 2.x version, I did not notice of the existence of NGramTokenFilter.

Stemming is anyway not useful for our problem domain (product catalogs).
We chained WhiteSpace Tokenizer with a modified version of ISOLatin1AccentFilter to nomalize some character based language aspects (e.g. ß = ss, ö = oe), then make the token Lowercase before getting the bigrams.
The advantage for us is anyway the TolerantPhraseQuery (see my other post " AW: Implement a relaxed PhraseQuery?") which gives us a first step for less language dependent searching.

Regards Uwe

-----Ursprüngliche Nachricht-----
Von: yu [mailto:[hidden email]]
Gesendet: Mittwoch, 26. März 2008 05:26
An: [hidden email]
Betreff: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Sorry for my ignorance, I am looking for

NgramStemFilter specifically.
Are you suggesting that it's the same as NGramTokenFilter? Does it have stemming in it?

Thanks again.

Jay


Otis Gospodnetic wrote:

> Sorry, I wrote this stuff, but forgot the naming.
> Look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: yu <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, March 26, 2008 12:04:33 AM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Otis,
> I checked that contrib before and could not find NgramStemFilter. Am I
> missing other contrib?
> Thanks for the link!
>
> Jay
>
> Otis Gospodnetic wrote:
>  
>> Hi Jay,
>>
>> Sorry, lapsus calami, that would be Lucene *contrib*.
>> Have a look:
>> http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: Jay <[hidden email]>
>> To: [hidden email]
>> Sent: Tuesday, March 25, 2008 6:15:54 PM
>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Sorry, I could not find the filter in the 2.3 API class list (core +
>> contrib + test). I am not ware of lucene config file either. Could you
>> please tell me where it is in 2.3 release?
>>
>> Thanks!
>>
>> Jay
>>
>> Otis Gospodnetic wrote:
>>  
>>    
>>> Jay,
>>>
>>> Have a look at Lucene config, it's all there, including tests.  This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram size.
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>> ----- Original Message ----
>>> From: Jay <[hidden email]>
>>> To: [hidden email]
>>> Sent: Tuesday, March 25, 2008 1:32:24 PM
>>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>
>>> Hi Uwe,
>>>
>>> I am curious what NGramStemFilter is? Is it a combination of porter
>>> stemming and word ngram identification?
>>>
>>> Thanks!
>>>
>>> Jay
>>>
>>> Uwe Goetzke wrote:
>>>    
>>>      
>>>> Hi Ivan,
>>>> No, we do not use StandardAnalyser or StandardTokenizer.
>>>>
>>>> Most data is processed by
>>>>     fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>>>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
>>>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>>>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer
>>>>
>>>> We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.
>>>>
>>>> Best Regards
>>>>
>>>> Uwe
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Ivan Vasilev [mailto:[hidden email]]
>>>> Gesendet: Freitag, 21. März 2008 16:25
>>>> An: [hidden email]
>>>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>>
>>>> Hi Uwe,
>>>>
>>>> Could you tell what Analyzer do you use when you marked so big indexing
>>>> speedup?
>>>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
>>>> reason is in it. You can see the pre last report in the thread "Indexing
>>>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
>>>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
>>>> that now is generated by JFlex instead of JavaCC.
>>>> I am asking because I noticed a great speedup in adding documents to
>>>> index in our system. We have time control on this in the debug mode. NOW
>>>> THEY ARE ADDED 5 TIMES FASTER!!!
>>>> But in the same time the total process of indexing in our case has
>>>> improvement of about 8%. As our system is very big and complex I am
>>>> wondering if really the whole process of indexing is reduces so
>>>> remarkably and our system causes this slowdown or may be Lucene does
>>>> some optimizations on the index, merges or something else and this is
>>>> the reason the total process of indexing to be not so reasonably faster.
>>>>
>>>> Best Regards,
>>>> Ivan
>>>>
>>>>
>>>>
>>>> Uwe Goetzke wrote:
>>>>      
>>>>        
>>>>> This week I switched the lucene library version on one customer system.
>>>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>>>> including optimisation. Great Job!
>>>>> We index product catalogs from several suppliers, in this case around
>>>>> 56.000 product groups and 360.000 products including descriptions were
>>>>> indexed.
>>>>>
>>>>> Regards
>>>>>
>>>>> Uwe
>>>>>
>>>>>
>>>>>


-----------------------------------------------------------------------
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Jay-98
Thanks, Uwe, for your clarification and for sharing your experience
which is very helpful!


Jay

Uwe Goetzke wrote:

> Hi Jay,
>
> Sorry for the confusion, I wrote NgramStemFilter in an early stage of the project which is essentially the same as NGramTokenFilter from Otis with the addition that I add begin and end token markers (e.g. word gets and _word_ and so  _w wo rd d_ ).
>
> As I modified a lot of our lucene code which we developed since lucene version 1.2 to move to a 2.x version, I did not notice of the existence of NGramTokenFilter.
>
> Stemming is anyway not useful for our problem domain (product catalogs).
> We chained WhiteSpace Tokenizer with a modified version of ISOLatin1AccentFilter to nomalize some character based language aspects (e.g. ß = ss, ö = oe), then make the token Lowercase before getting the bigrams.
> The advantage for us is anyway the TolerantPhraseQuery (see my other post " AW: Implement a relaxed PhraseQuery?") which gives us a first step for less language dependent searching.
>
> Regards Uwe
>
> -----Ursprüngliche Nachricht-----
> Von: yu [mailto:[hidden email]]
> Gesendet: Mittwoch, 26. März 2008 05:26
> An: [hidden email]
> Betreff: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Sorry for my ignorance, I am looking for
>
> NgramStemFilter specifically.
> Are you suggesting that it's the same as NGramTokenFilter? Does it have stemming in it?
>
> Thanks again.
>
> Jay
>
>
> Otis Gospodnetic wrote:
>> Sorry, I wrote this stuff, but forgot the naming.
>> Look: http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: yu <[hidden email]>
>> To: [hidden email]
>> Sent: Wednesday, March 26, 2008 12:04:33 AM
>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Hi Otis,
>> I checked that contrib before and could not find NgramStemFilter. Am I
>> missing other contrib?
>> Thanks for the link!
>>
>> Jay
>>
>> Otis Gospodnetic wrote:
>>  
>>> Hi Jay,
>>>
>>> Sorry, lapsus calami, that would be Lucene *contrib*.
>>> Have a look:
>>> http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>> ----- Original Message ----
>>> From: Jay <[hidden email]>
>>> To: [hidden email]
>>> Sent: Tuesday, March 25, 2008 6:15:54 PM
>>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>
>>> Sorry, I could not find the filter in the 2.3 API class list (core +
>>> contrib + test). I am not ware of lucene config file either. Could you
>>> please tell me where it is in 2.3 release?
>>>
>>> Thanks!
>>>
>>> Jay
>>>
>>> Otis Gospodnetic wrote:
>>>  
>>>    
>>>> Jay,
>>>>
>>>> Have a look at Lucene config, it's all there, including tests.  This filter will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram size.
>>>>
>>>> Otis
>>>> --
>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>>
>>>> ----- Original Message ----
>>>> From: Jay <[hidden email]>
>>>> To: [hidden email]
>>>> Sent: Tuesday, March 25, 2008 1:32:24 PM
>>>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>>
>>>> Hi Uwe,
>>>>
>>>> I am curious what NGramStemFilter is? Is it a combination of porter
>>>> stemming and word ngram identification?
>>>>
>>>> Thanks!
>>>>
>>>> Jay
>>>>
>>>> Uwe Goetzke wrote:
>>>>    
>>>>      
>>>>> Hi Ivan,
>>>>> No, we do not use StandardAnalyser or StandardTokenizer.
>>>>>
>>>>> Most data is processed by
>>>>>     fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>>>>     result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified that ö -> oe
>>>>>     result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>>>>     result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a bigram tokenizer
>>>>>
>>>>> We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token.
>>>>>
>>>>> Best Regards
>>>>>
>>>>> Uwe
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: Ivan Vasilev [mailto:[hidden email]]
>>>>> Gesendet: Freitag, 21. März 2008 16:25
>>>>> An: [hidden email]
>>>>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>>>
>>>>> Hi Uwe,
>>>>>
>>>>> Could you tell what Analyzer do you use when you marked so big indexing
>>>>> speedup?
>>>>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
>>>>> reason is in it. You can see the pre last report in the thread "Indexing
>>>>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
>>>>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
>>>>> that now is generated by JFlex instead of JavaCC.
>>>>> I am asking because I noticed a great speedup in adding documents to
>>>>> index in our system. We have time control on this in the debug mode. NOW
>>>>> THEY ARE ADDED 5 TIMES FASTER!!!
>>>>> But in the same time the total process of indexing in our case has
>>>>> improvement of about 8%. As our system is very big and complex I am
>>>>> wondering if really the whole process of indexing is reduces so
>>>>> remarkably and our system causes this slowdown or may be Lucene does
>>>>> some optimizations on the index, merges or something else and this is
>>>>> the reason the total process of indexing to be not so reasonably faster.
>>>>>
>>>>> Best Regards,
>>>>> Ivan
>>>>>
>>>>>
>>>>>
>>>>> Uwe Goetzke wrote:
>>>>>      
>>>>>        
>>>>>> This week I switched the lucene library version on one customer system.
>>>>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>>>>> including optimisation. Great Job!
>>>>>> We index product catalogs from several suppliers, in this case around
>>>>>> 56.000 product groups and 360.000 products including descriptions were
>>>>>> indexed.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Uwe
>>>>>>
>>>>>>
>>>>>>
>
>
> -----------------------------------------------------------------------
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]