issues with wildcard search and snowball english analyzer

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

issues with wildcard search and snowball english analyzer

JBTech
I am using SnowballAnalayzer(English).
I just created one document with one field with content as "elephant is a big animal".
I searched for e*t using queryparser.
This did not return any results.
I indexed with "lion is a big animal".
Then searched for l*t. This returned one result as expected.
I looked at the index using Luke and figured out that elephant has been steemed to eleph by analyzer.
I reindexed "elephant is a big animal" and tried with e*p, this time I got one hit.
I like the stemming as it stems tests, tested, testing etc... to test.
Is there a way to avoid stemming in certain cases?
Reply | Threaded
Open this post in threaded view
|

Re: issues with wildcard search and snowball english analyzer

Andrew Gilmartin-2
--- On Thu, 7/24/08, JBTech <[hidden email]> wrote:

> Is there a way to avoid stemming in certain cases?

As a general rule, make the query intelligent and not the index. Therefore, index your text verbatim. Small changes like changing terms to lowercase and removing possessives are fine. You now have an index upon which you can make intelligent queries.

An intelligent query requires keeping track of several collections of term-to-term(s) mappings. For example, stemmed-term to verbatim-term(s). Now, convert the users search for "elephant is a big animal" into something akin to

( (elephant^10) OR (A) OR (B) ) AND
( (big^10) OR (C) ) AND
( (animal^10) OR (D) )

Where A and B are other terms with the same stemming as elephant, C is another term with the same stemming as big, and D is a another term with the same stemming as animal. Adding the boost ensures that a verbatim match pushes the document's rank higher and so ensure that what the user asked for is closer to the top.

This basic idea of making the queries more intelligent by broadening them and boosting term weights gives you a lot of control over the query and how results are ranked. The same control is not possible by making the index more intelligent.

Don't worry about Lucene's performance with complex queries. My experience is that it is very fast.

And to answer your specific question, search for "e*t" will work as is.

-- Andrew



Reply | Threaded
Open this post in threaded view
|

Re: issues with wildcard search and snowball english analyzer

JBTech
Hi Andrew,
Thanks for your quick reply.
I tried with e*t and that did not return any results.
I am using Lucene 2.2.
The full word elephant returned one hit as I am using the same analayzer for indexing and searching.
I uploaded the java class I used for testing this.
Thanks
JB
Andrew Gilmartin-2 wrote
--- On Thu, 7/24/08, JBTech <jb4tech@gmail.com> wrote:

> Is there a way to avoid stemming in certain cases?

As a general rule, make the query intelligent and not the index. Therefore, index your text verbatim. Small changes like changing terms to lowercase and removing possessives are fine. You now have an index upon which you can make intelligent queries.

An intelligent query requires keeping track of several collections of term-to-term(s) mappings. For example, stemmed-term to verbatim-term(s). Now, convert the users search for "elephant is a big animal" into something akin to

( (elephant^10) OR (A) OR (B) ) AND
( (big^10) OR (C) ) AND
( (animal^10) OR (D) )

Where A and B are other terms with the same stemming as elephant, C is another term with the same stemming as big, and D is a another term with the same stemming as animal. Adding the boost ensures that a verbatim match pushes the document's rank higher and so ensure that what the user asked for is closer to the top.

This basic idea of making the queries more intelligent by broadening them and boosting term weights gives you a lot of control over the query and how results are ranked. The same control is not possible by making the index more intelligent.

Don't worry about Lucene's performance with complex queries. My experience is that it is very fast.

And to answer your specific question, search for "e*t" will work as is.

-- Andrew


Testing.java
Reply | Threaded
Open this post in threaded view
|

Re: issues with wildcard search and snowball english analyzer

Andrew Gilmartin-2
--- On Fri, 7/25/08, JBTech <[hidden email]> wrote:

> I tried with e*t and that did not return any results.

Hum. Example code would be helpful now.

-- Andrew