indexing api wrt Analyzer

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

indexing api wrt Analyzer

John Wang-9
Hi all:

    Maybe this has been asked before:

    I am building an index consists of multiple languages, (stored as a
field), and I have different analyzers depending on the language of the
language to be indexed. But the IndexWriter takes only an Analyzer.

    I was hoping to have IndexWriter take an AnalyzerFactory, where the
AnalyzerFactory produces Analyzer depending on some criteria of the
document, e.g. language.

    Maybe I am going about the wrong way.

    Any suggestions on how to go about?

Thanks

-John
Reply | Threaded
Open this post in threaded view
|

Re: indexing api wrt Analyzer

Asgeir Frimannsson-2
On Thu, Mar 13, 2008 at 10:40 AM, John Wang <[hidden email]> wrote:

> Hi all:
>
>    Maybe this has been asked before:
>
>    I am building an index consists of multiple languages, (stored as a
> field), and I have different analyzers depending on the language of the
> language to be indexed. But the IndexWriter takes only an Analyzer.
>
>    I was hoping to have IndexWriter take an AnalyzerFactory, where the
> AnalyzerFactory produces Analyzer depending on some criteria of the
> document, e.g. language.
>
>    Maybe I am going about the wrong way.
>
>    Any suggestions on how to go about?
>

Perhaps this is what you are searching for:

http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html

With PerFieldAnalyzerWrapper, you can specify which analyzer to use with
each field, as well as a default analyzer.

cheers,
asgeir
Reply | Threaded
Open this post in threaded view
|

Re: indexing api wrt Analyzer

Daniel Noll-3-2
On Thursday 13 March 2008 15:21:19 Asgeir Frimannsson wrote:
> >    I was hoping to have IndexWriter take an AnalyzerFactory, where the
> > AnalyzerFactory produces Analyzer depending on some criteria of the
> > document, e.g. language.

> With PerFieldAnalyzerWrapper, you can specify which analyzer to use with
> each field, as well as a default analyzer.

Certainly this would work as long as you store each language in a different
Lucene field.  This is probably a good idea anyway as it will be easier for
the QueryParser where there won't necessarily be enough text to determine the
language easily.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing api wrt Analyzer

Grant Ingersoll-2
In reply to this post by John Wang-9
On IndexWriter, you can pass in the Analyzer when you add a Document,  
thus your application can identify the language, choose the analyzer  
for the given doc, and then add the document

See
public void addDocument(Document doc, Analyzer analyzer)


On Mar 12, 2008, at 8:40 PM, John Wang wrote:

> Hi all:
>
>    Maybe this has been asked before:
>
>    I am building an index consists of multiple languages, (stored as a
> field), and I have different analyzers depending on the language of  
> the
> language to be indexed. But the IndexWriter takes only an Analyzer.
>
>    I was hoping to have IndexWriter take an AnalyzerFactory, where the
> AnalyzerFactory produces Analyzer depending on some criteria of the
> document, e.g. language.
>
>    Maybe I am going about the wrong way.
>
>    Any suggestions on how to go about?
>
> Thanks
>
> -John

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing api wrt Analyzer

John Wang-9
Yes, but usually it's a good idea to add documents in batch and not having
to reinstantiate the writer for every document and then closing it.

It would be nice if one can specify to the writer which analyzer to use.

PerfieldAnalyzer wouldn't work because different analyzers may apply on the
same field depending on the doc, e.g.

if (field1.name.equals("fr"))
    use FrenchAnalyzer on content field
etc.

-John

On Thu, Mar 13, 2008 at 4:53 AM, Grant Ingersoll <[hidden email]>
wrote:

> On IndexWriter, you can pass in the Analyzer when you add a Document,
> thus your application can identify the language, choose the analyzer
> for the given doc, and then add the document
>
> See
> public void addDocument(Document doc, Analyzer analyzer)
>
>
> On Mar 12, 2008, at 8:40 PM, John Wang wrote:
>
> > Hi all:
> >
> >    Maybe this has been asked before:
> >
> >    I am building an index consists of multiple languages, (stored as a
> > field), and I have different analyzers depending on the language of
> > the
> > language to be indexed. But the IndexWriter takes only an Analyzer.
> >
> >    I was hoping to have IndexWriter take an AnalyzerFactory, where the
> > AnalyzerFactory produces Analyzer depending on some criteria of the
> > document, e.g. language.
> >
> >    Maybe I am going about the wrong way.
> >
> >    Any suggestions on how to go about?
> >
> > Thanks
> >
> > -John
>
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing api wrt Analyzer

Grant Ingersoll-2

On Mar 13, 2008, at 11:03 AM, John Wang wrote:

> Yes, but usually it's a good idea to add documents in batch and not  
> having
> to reinstantiate the writer for every document and then closing it.
>

Why does what I suggested require instantiating a new writer for every  
document?  It uses the analyzer you pass in w/ the method:

IndexWriter writer = new IndexWriter(dir, defaultAnalyzer,....)

while adding docs
    Document doc = ...
    Analyzer analyzer = getAnalyzer(language)
    writer.addDocument(doc, analyzer)

writer.close()

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing api wrt Analyzer

Grant Ingersoll-2
In reply to this post by John Wang-9

On Mar 13, 2008, at 11:03 AM, John Wang wrote:

> Yes, but usually it's a good idea to add documents in batch and not  
> having
> to reinstantiate the writer for every document and then closing it.
>
> It would be nice if one can specify to the writer which analyzer to  
> use.
>
> PerfieldAnalyzer wouldn't work because different analyzers may apply  
> on the
> same field depending on the doc, e.g.
>

Also, I don't know that it is wise to put different langs in the same  
field.  I can't prove it definitively, but it seems to me your corpus  
statistics could be skewed by terms that are spelled the same but have  
different meanings across languages.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing api wrt Analyzer

John Wang-9
Hi Grant:

    For our corpus, we don't rely on idf in scoring calculation that much,
so I don't see that being a problem that much.

    About performance, instantiating 1 indexWriter for a batch of say 1000
docs, e.g. iterate over 1000 docs and do addDocument; comparing with
instantiating and closing 1000 indexWriters each doing 1 addDocument. Are
you saying the expected performance is the same? I thought when you call
addDocument, it adds to memory and flush when segment needs to be merged or
writer closes.

    Maybe I am missing something.

Thanks

-john

On Thu, Mar 13, 2008 at 11:37 AM, Grant Ingersoll <[hidden email]>
wrote:

>
> On Mar 13, 2008, at 11:03 AM, John Wang wrote:
>
> > Yes, but usually it's a good idea to add documents in batch and not
> > having
> > to reinstantiate the writer for every document and then closing it.
> >
> > It would be nice if one can specify to the writer which analyzer to
> > use.
> >
> > PerfieldAnalyzer wouldn't work because different analyzers may apply
> > on the
> > same field depending on the doc, e.g.
> >
>
> Also, I don't know that it is wise to put different langs in the same
> field.  I can't prove it definitively, but it seems to me your corpus
> statistics could be skewed by terms that are spelled the same but have
> different meanings across languages.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing api wrt Analyzer

Grant Ingersoll-2
There is an addDocument method that takes an Analyzer and overrides  
the one used at construction of the IndexWriter.  See
http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/index/IndexWriter.html#addDocument(org.apache.lucene.document.Document,%20org.apache.lucene.analysis.Analyzer)
.



On Mar 13, 2008, at 4:12 PM, John Wang wrote:

> Hi Grant:
>
>    For our corpus, we don't rely on idf in scoring calculation that  
> much,
> so I don't see that being a problem that much.
>
>    About performance, instantiating 1 indexWriter for a batch of say  
> 1000
> docs, e.g. iterate over 1000 docs and do addDocument; comparing with
> instantiating and closing 1000 indexWriters each doing 1  
> addDocument. Are
> you saying the expected performance is the same? I thought when you  
> call
> addDocument, it adds to memory and flush when segment needs to be  
> merged or
> writer closes.
>
>    Maybe I am missing something.
>
> Thanks
>
> -john
>
> On Thu, Mar 13, 2008 at 11:37 AM, Grant Ingersoll  
> <[hidden email]>
> wrote:
>
>>
>> On Mar 13, 2008, at 11:03 AM, John Wang wrote:
>>
>>> Yes, but usually it's a good idea to add documents in batch and not
>>> having
>>> to reinstantiate the writer for every document and then closing it.
>>>
>>> It would be nice if one can specify to the writer which analyzer to
>>> use.
>>>
>>> PerfieldAnalyzer wouldn't work because different analyzers may apply
>>> on the
>>> same field depending on the doc, e.g.
>>>
>>
>> Also, I don't know that it is wise to put different langs in the same
>> field.  I can't prove it definitively, but it seems to me your corpus
>> statistics could be skewed by terms that are spelled the same but  
>> have
>> different meanings across languages.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing api wrt Analyzer

John Wang-9
Excellent!
Exactly what I was looking for!

Thanks Grant!

-John

On Thu, Mar 13, 2008 at 5:39 PM, Grant Ingersoll <[hidden email]>
wrote:

> There is an addDocument method that takes an Analyzer and overrides
> the one used at construction of the IndexWriter.  See
>
> http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/index/IndexWriter.html#addDocument(org.apache.lucene.document.Document,%20org.apache.lucene.analysis.Analyzer)<http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/index/IndexWriter.html#addDocument%28org.apache.lucene.document.Document,%20org.apache.lucene.analysis.Analyzer%29>
> .
>
>
>
> On Mar 13, 2008, at 4:12 PM, John Wang wrote:
>
> > Hi Grant:
> >
> >    For our corpus, we don't rely on idf in scoring calculation that
> > much,
> > so I don't see that being a problem that much.
> >
> >    About performance, instantiating 1 indexWriter for a batch of say
> > 1000
> > docs, e.g. iterate over 1000 docs and do addDocument; comparing with
> > instantiating and closing 1000 indexWriters each doing 1
> > addDocument. Are
> > you saying the expected performance is the same? I thought when you
> > call
> > addDocument, it adds to memory and flush when segment needs to be
> > merged or
> > writer closes.
> >
> >    Maybe I am missing something.
> >
> > Thanks
> >
> > -john
> >
> > On Thu, Mar 13, 2008 at 11:37 AM, Grant Ingersoll
> > <[hidden email]>
> > wrote:
> >
> >>
> >> On Mar 13, 2008, at 11:03 AM, John Wang wrote:
> >>
> >>> Yes, but usually it's a good idea to add documents in batch and not
> >>> having
> >>> to reinstantiate the writer for every document and then closing it.
> >>>
> >>> It would be nice if one can specify to the writer which analyzer to
> >>> use.
> >>>
> >>> PerfieldAnalyzer wouldn't work because different analyzers may apply
> >>> on the
> >>> same field depending on the doc, e.g.
> >>>
> >>
> >> Also, I don't know that it is wise to put different langs in the same
> >> field.  I can't prove it definitively, but it seems to me your corpus
> >> statistics could be skewed by terms that are spelled the same but
> >> have
> >> different meanings across languages.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>