Indexing accented characters, then searching by any form

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing accented characters, then searching by any form

Cesar Ronchese
Hello, guys.

I've searching the google to make the lucene performs accent-insensitive searches.

All I could find is about the ISOLatin1AccentFilter class, which as far I could understand, it just removes the accented chars so I can store it in its unaccented form.

What I would like to know is, is there a way to store the content in your original accented format, and make an accent-insensitive query with lucene? How?

For example:
Indexed word: usuário
Terms typed by the user, to find the word above: usuário or usuario or usuãrio, etc.

Thanks in advance.
Cesar
Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Karl Wettin

11 feb 2008 kl. 16.00 skrev Cesar Ronchese:

>
> Hello, guys.
>
> I've searching the google to make the lucene performs accent-
> insensitive
> searches.
>
> All I could find is about the ISOLatin1AccentFilter class, which as  
> far I
> could understand, it just removes the accented chars so I can store  
> it in
> its unaccented form.

What is the problem you have with this? Are they not unique enough?

> What I would like to know is, is there a way to store the content in  
> your
> original accented format, and make an accent-insensitive query with  
> lucene?
> How?

You would have to add a synonym for each permutation of the accented  
term.



    karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Karl Wettin

11 feb 2008 kl. 16.08 skrev Karl Wettin:

>> All I could find is about the ISOLatin1AccentFilter class, which as  
>> far I
>> could understand, it just removes the accented chars so I can store  
>> it in
>> its unaccented form.
>
> What is the problem you have with this? Are they not unique enough?


One more thing,

are you aware of that you are supposed to apply that filter on the  
query too?


   karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Erick Erickson
In reply to this post by Cesar Ronchese
I'm inferring that you need the original text for display purposes or some
such,
but want to search a "canonical" form. So the following may be totally
irrelevant if my inference is wrong.....

Indexed and stored are two very distinct things in Lucene. If you create
a field that is both stored and indexed, the indexed part goes through
the analyzer and the stored part does not. So you get the best of both
worlds, your search goes against the analyzed code but if you fetch
the field, it's in the original format.

I didn't understand this until after creating the first product using
Lucene, so
one of our production applications has some fields stored but not indexed
and
the *same* data indexed but not stored.. Siiiiggghhh.

Try this code to gain comfort. It uses the casing as a stand-in for accents,
but
you can easily adapt it to try your accented cases.

    public static void main(String[] args) throws Exception {
        try {
            IndexWriter iw = new IndexWriter("C:/test", new
StandardAnalyzer());
            Document doc = new Document();
            doc.add(new Field("blivet", "This is some Mixed Case Text",
Field.Store.YES, Field.Index.TOKENIZED));
            iw.addDocument(doc);
            iw.close();

            IndexSearcher search = new IndexSearcher("c:/test");
            QueryParser qp = new QueryParser("blivet", new
StandardAnalyzer());
            Query q = qp.parse("mixed"); // only matches if StandardAnalyzer
lower-cased the input.
            Hits hits = search.search(q);
            System.out.println("Count = " + Integer.toString(hits.length
()));
            System.out.println(search.getIndexReader().document(0).get("blivet"));
// Outputs mixed case stored field.
        } catch (Exception e) {
            System.err.println("Caught Exception");
            System.err.flush();
            e.printStackTrace();
        }
    }


Best
Erick

On Feb 11, 2008 10:00 AM, Cesar Ronchese <[hidden email]> wrote:

>
> Hello, guys.
>
> I've searching the google to make the lucene performs accent-insensitive
> searches.
>
> All I could find is about the ISOLatin1AccentFilter class, which as far I
> could understand, it just removes the accented chars so I can store it in
> its unaccented form.
>
> What I would like to know is, is there a way to store the content in your
> original accented format, and make an accent-insensitive query with
> lucene?
> How?
>
> For example:
> Indexed word: usuário
> Terms typed by the user, to find the word above: usuário or usuario or
> usuãrio, etc.
>
> Thanks in advance.
> Cesar
> --
> View this message in context:
> http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p15412778.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Cesar Ronchese
In reply to this post by Karl Wettin
Hey Karl. Thanks for the response. I have some doubts more:

1) About the ISOLatin1AccentFilter class:
> What is the problem you have with this? Are they not unique enough?

I need to store the words in the way it was written. So, if the text to be indexed contains the word "usuário", my user expects, in his search results, see the word in its correct form, "usuário", not "usuario".

I could be wrong, but that is what I think ISOLatin1AccentFilter does: performs the content indexing by removing the accents, then every later results will come without accents also. In short, I lose the accents forever.


2) About the synonym and permutation:
> You would have to add a synonym for each permutation of the accented term.

Do your have a code sample for this? Or even a link reference on how-to?
Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Cesar Ronchese
In reply to this post by Erick Erickson
Hey, Erick. You inferred right.

I analized your code and it looks like a common Indexing and Searching code. Are you sure you pasted the correct code? :P

Anyways, is the concept about doubling storing data, one content with accents and other without? If yes, I did it earlier, but once I search in the non-accent content and show accent content, the HitHighlighter will now work properly.
Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Cesar Ronchese
In reply to this post by Karl Wettin
> One more thing,
> are you aware of that you are supposed to apply that filter on the  
> query too?

I don't know how to set that filter to Query object. I've searched to see if it is possible, but I can't find references. If it is possible, do you have a quick example?

I'm searching this way:

IndexReader objIndexReader = GetIndexReader(); //GetIndexReader()  is my function to retrieve an IndexReader
IndexSearcher objSearcher = new IndexSearcher(objIndexReader);
StandardAnalyzer objAnalyzer = new StandardAnalyzer();
MultiFieldQueryParser objParser = new MultiFieldQueryParser(GetFields(), objAnalyzer);
Query objQuery = objParser.parse(SearchText);
objQuery = objQuery.rewrite(objIndexReader);
Hits objHits = objSearcher.search(objQuery);
Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Erick Erickson
In reply to this post by Cesar Ronchese
See below...


On Feb 11, 2008 12:17 PM, Cesar Ronchese <[hidden email]> wrote:

>
> Hey, Erick. You inferred right.
>
> I analized your code and it looks like a common Indexing and Searching
> code.
> Are you sure you pasted the correct code? :P
>

Did you try to run it? It's just a self-contained example showing that
searching
and displaying are distinct.

The indexer part indexes a mixed-case string. The search is then
performed on a lower-case string, and the println shows that a
document was found. The next println echoes back the stored text
showing that the original was stored. Just substitute your preferred
filter to see how this would work for you.



>
> Anyways, is the concept about doubling storing data, one content with
> accents and other without? If yes, I did it earlier, but once I search in
> the non-accent content and show accent content, the HitHighlighter will
> now
> work properly.
> --
>

Is this a typo or is your problem solved? I confess that haven't had the
necessity to use the highlighter package yet, so I may be missing
something...

But you're not really "double storing". You'll find that indexed code takes
MUCH less space than you would think, nowhere near the amount
required to store the data too. So there's good reason to separate the two.

You have no choice except to store the data if you want the user to see
something pretty.....

Erick


>
> View this message in context:
> http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p15415770.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Karl Wettin
In reply to this post by Cesar Ronchese

11 feb 2008 kl. 18.16 skrev Cesar Ronchese:

> I don't know how to set that filter to Query object.

It is a TokenStream you filter, not the Query. In your case the  
TokenStream is produced by the QueryParser invoking  
analyzer.tokenStream(field, new StringReader(input)). So what you have  
to do is to replace the analyzer with your own implementation.

>
> I'm searching this way:

> StandardAnalyzer objAnalyzer = new StandardAnalyzer();

Try this (dry coded) snippet instead:

StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
   public TokenStream tokenStream(String fieldName, Reader reader) {
     return new ISOLatin1AccentFilter(super.tokenStream(fieldName,  
reader));
   }
}

Looking at the code you also probably want to reuse your Analyzer,  
IndexSearcher and IndexReader and not create a new instance each time  
you place a query.  You can read more about thar and many other things  
here: <http://wiki.apache.org/lucene-java/BasicsOfPerformance>


    karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Cesar Ronchese
Woot, Karl.
It worked like a charm! It even worked with the Highlighter. THANKS!


karl wettin-3 wrote
11 feb 2008 kl. 18.16 skrev Cesar Ronchese:

> I don't know how to set that filter to Query object.

It is a TokenStream you filter, not the Query. In your case the  
TokenStream is produced by the QueryParser invoking  
analyzer.tokenStream(field, new StringReader(input)). So what you have  
to do is to replace the analyzer with your own implementation.

>
> I'm searching this way:

> StandardAnalyzer objAnalyzer = new StandardAnalyzer();

Try this (dry coded) snippet instead:

StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
   public TokenStream tokenStream(String fieldName, Reader reader) {
     return new ISOLatin1AccentFilter(super.tokenStream(fieldName,  
reader));
   }
}

Looking at the code you also probably want to reuse your Analyzer,  
IndexSearcher and IndexReader and not create a new instance each time  
you place a query.  You can read more about thar and many other things  
here: <http://wiki.apache.org/lucene-java/BasicsOfPerformance>


    karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Cesar Ronchese
Ops!
Found a situation here Karl:
If the content is stored without accents, everything is OK.
But, as my content is stored with accents, and I noticed the ISOFilter just removes the accent from the search terms, it is not returning to my Hits collection.

Any idea how to fix it?
Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Petite Abeille-2-2
In reply to this post by Cesar Ronchese

On Feb 11, 2008, at 4:00 PM, Cesar Ronchese wrote:

> For example:
> Indexed word: usuário
> Terms typed by the user, to find the word above: usuário or usuario or
> usuãrio, etc.

If you feel ambitious, you can try something along the lines of Sean  
M. Burke's Unidecode!:

http://interglacial.com/~sburke/tpj/as_html/tpj22.html


E.g. searching for 'aaiun':

http://svr225.stepx.com:3388/search?q=Aaiún


Returns 'El Aaiún':

http://svr225.stepx.com:3388/el-aaiun


Or perhaps 'eire':

http://svr225.stepx.com:3388/search?q=eur


Returns 'Éire':

http://svr225.stepx.com:3388/eur2-commemorative-coins


Cheers,

PA.
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Cesar Ronchese
In reply to this post by Erick Erickson
Well, it is done now.

As final result, I surrended myself to "double-storing". This way, I have indexed the original text with COMPRESSED option to save some space.

And to highlight the results correctly, I made some matching between unaccented-words and original words by regular expressions, and the results is satisfactory.

Thanks all for the brainstorming ^^
Cesar




Erick Erickson wrote
See below...


On Feb 11, 2008 12:17 PM, Cesar Ronchese <ronchese@hotmail.com> wrote:

>
> Hey, Erick. You inferred right.
>
> I analized your code and it looks like a common Indexing and Searching
> code.
> Are you sure you pasted the correct code? :P
>

Did you try to run it? It's just a self-contained example showing that
searching
and displaying are distinct.

The indexer part indexes a mixed-case string. The search is then
performed on a lower-case string, and the println shows that a
document was found. The next println echoes back the stored text
showing that the original was stored. Just substitute your preferred
filter to see how this would work for you.



>
> Anyways, is the concept about doubling storing data, one content with
> accents and other without? If yes, I did it earlier, but once I search in
> the non-accent content and show accent content, the HitHighlighter will
> now
> work properly.
> --
>

Is this a typo or is your problem solved? I confess that haven't had the
necessity to use the highlighter package yet, so I may be missing
something...

But you're not really "double storing". You'll find that indexed code takes
MUCH less space than you would think, nowhere near the amount
required to store the data too. So there's good reason to separate the two.

You have no choice except to store the data if you want the user to see
something pretty.....

Erick


>
> View this message in context:
> http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p15415770.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing accented characters, then searching by any form

Dora
In reply to this post by Karl Wettin

Karl Wettin wrote
Try this (dry coded) snippet instead:

StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
   public TokenStream tokenStream(String fieldName, Reader reader) {
     return new ISOLatin1AccentFilter(super.tokenStream(fieldName,  
reader));
   }
}
I tried this, but it does not work as expected.

I am using an utility class with a static method that gives me an analyzer:

public static Analyzer getAnalyzer()
        {  
                StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
                           public TokenStream tokenStream(String fieldName, Reader reader) {
                             return new ISOLatin1AccentFilter(super.tokenStream(fieldName, reader));
                           }
                        };
                        return objAnalyzer;
                }
        }

So when I need the analyzer (for indexing or searching) I perform an UtilityClass.getAnalyzer() call.

It works for my query parser: The accent are correctly removed when performing the search.
If my index contains "cafe" searching for "café" will find the documents containing "cafe"

But when explore my index with Luke I can see that the indexer does not use the ISOLatin1AccentFilter  (I tested with a breakpoint in the overriden tokenStream method) and if the document contains "café", the index will contain "café".

As a consequence, search on word having accent is not possible: the index contains the accent, while it is removed by the search process.

So my index contains "café", but when I search for "café" the filter changes it in "cafe" and it gives no hit...

Any clue on why my filter is not used at time of indexation ?



Reply | Threaded
Open this post in threaded view
|

RE: Indexing accented characters, then searching by any form

diego.cassinera
Are you sure you are creating the fields with Field.Index.ANALYZED ?

-----Mensaje original-----
De: Dora [mailto:[hidden email]]
Enviado el: martes, 25 de noviembre de 2008 12:22 p.m.
Para: [hidden email]
Asunto: Re: Indexing accented characters, then searching by any form




Karl Wettin wrote:

>
> Try this (dry coded) snippet instead:
>
> StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      return new ISOLatin1AccentFilter(super.tokenStream(fieldName,  
> reader));
>    }
> }
>

I tried this, but it does not work as expected.

I am using an utility class with a static method that gives me an analyzer:

public static Analyzer getAnalyzer()
        {  
                StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
                           public TokenStream tokenStream(String fieldName, Reader reader) {
                             return new ISOLatin1AccentFilter(super.tokenStream(fieldName,
reader));
                           }
                        };
                        return objAnalyzer;
                }
        }

So when I need the analyzer (for indexing or searching) I perform an
UtilityClass.getAnalyzer() call.

It works for my query parser: The accent are correctly removed when
performing the search.
If my index contains "cafe" searching for "café" will find the documents
containing "cafe"

But when explore my index with Luke I can see that the indexer does not use
the ISOLatin1AccentFilter  (I tested with a breakpoint in the overriden
tokenStream method) and if the document contains "café", the index will
contain "café".

As a consequence, search on word having accent is not possible: the index
contains the accent, while it is removed by the search process.

So my index contains "café", but when I search for "café" the filter changes
it in "cafe" and it gives no hit...

Any clue on why my filter is not used at time of indexation ?




--
View this message in context: http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p20682548.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Indexing accented characters, then searching by any form

Dora

Diego Cassinera wrote
Are you sure you are creating the fields with Field.Index.ANALYZED ?
Yes, my fields are all ANALYZED. (One was ANALYZED_NO_NORMS but changing it to ANALYZED did not solve the problem)

I checked with the debugger, and the analyzer I use tu update my indexer does contain my ISOLatin1AccentFilter.

It looks like the indexWriter does not go through the tokenStream method.
Maybe this is because I perform an updateDocument() instead of a addDocument() ?

Here is how I index a document:
m_analyzer is an Analyzer returned by my getAnalyzer method
field and field value are a "key" to my document (a unique ID)
 
IndexWriter luceneIndexWriter = new IndexWriter(m_indexDir, m_analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
luceneIndexWriter.updateDocument(new Term(field, fieldValue), luceneDocument);
Reply | Threaded
Open this post in threaded view
|

RE: Indexing accented characters, then searching by any form

Dora
It seems that the index and search process does not work in the same way:

The "tokenStream" method is called at time of search while for indexing the "resusableTokenStream" is called.

Overriding resusableTokenStream (like I did for tokenStream) fixed the problem.