Quotes dependent StopWords removal

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Quotes dependent StopWords removal

Sameer Maggon
Currently, in my application (that uses Lucene), I am using a Porter + StandardAnalyzer (with stop words).



I would like to do the following:

When the user performs a search, the analyzer should remove the "stopwords" only if the stop word is not present in quotes. If the stop word is present in quotes, I don't want the stop word to be removed by the analyzer.



For e.g.



"no dress code" - should not remove "no"  as it's present in quotes.



shirts with trousers - should remove "with" as a stop word.



I have been trying to do this with Lucene, but have not found a straight forward way of doing it. I have been digging in Lucene mail archives, but it seems like there is no easy way to do this apart from extending / modifying the QueryParser. In some sense, it is similar to the issue discussed in:



http://www.gossamer-threads.com/lists/lucene/java-user/38946



Is there any way I can avoid subclassing QueryParser ?



Thanks,

Sameer Maggon.

Reply | Threaded
Open this post in threaded view
|

Re: Quotes dependent StopWords removal

Mark Miller-3
If you do not put the stop words in the index during analysis what use
will they be at search time?

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Quotes dependent StopWords removal

Sameer Maggon
I won't remove the stop words while indexing.

Sameer.

-----Original Message-----
From Mark Miller <[hidden email]>
Sent Tue 8/15/2006 3:36 PM
To [hidden email]
Subject Re: Quotes dependent StopWords removal

If you do not put the stop words in the index during analysis what use
will they be at search time?

- Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Quotes dependent StopWords removal

Mark Miller-3
This appears tricky to me. I may be completely wrong but I would start
by looking at the Standard Analyzer. I would try and create a new token
that matched an open parenthesis. I would then change the next() method
in StandardAnalyzer.jj to mark when it recognizes an open parenthesis.
Now you are in a quote. Somehow mark each token (might not be an obvious
way to do this) until you find another close parenthesis. Now mark that
you are not in a quote. When not in a quote do not mark the tokens
coming out of Next(). ) Now in the Stop Filter, check the token for your
marker and do not remove it if it is marked.

Bear in mind...this all may be worthless speculating...

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Quotes dependent StopWords removal

Mark Miller-3
In reply to this post by Sameer Maggon
My last answer was terrible. QueryParser will not sent any parenthesis
into the analyzer. How about this:

Below are lines about 965-992 of QueryParser.java. Change
getFieldQuery(field, term.image.substring(1, term.image.length()-1), s)
(line 992) to call an identical function to the one called except have
this function use an analyzer that does not remove stop words. Case
QUOTED occurs when a QUOTED token is eaten. getFieldQuery puts that
token (or tokens, possibly at the same position) through an analyzer and
returns a Query object. You want that analyzer that is used to not strip
stop words if the token type is QUOTED. Sounds reasonable to me.

Now replacing the entire method to just change the analyzer is very
brute force but maybe it will spark an idea to something more elegant.
Same "bear in mind this might be BS" applies to this answer.

- Mark

line 965 of QueryParser.java
    case QUOTED:
      term = jj_consume_token(QUOTED);
      switch ((jj_ntk==-1)?jj_ntk():jj_ntk) {
      case FUZZY_SLOP:
        fuzzySlop = jj_consume_token(FUZZY_SLOP);
        break;
      default:
        jj_la1[19] = jj_gen;
        ;
      }
      switch ((jj_ntk==-1)?jj_ntk():jj_ntk) {
      case CARAT:
        jj_consume_token(CARAT);
        boost = jj_consume_token(NUMBER);
        break;
      default:
        jj_la1[20] = jj_gen;
        ;
      }
         int s = phraseSlop;

         if (fuzzySlop != null) {
           try {
             s = Float.valueOf(fuzzySlop.image.substring(1)).intValue();
           }
           catch (Exception ignored) { }
         }
         q = getFieldQuery(field, term.image.substring(1,
term.image.length()-1), s);

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Quotes dependent StopWords removal

Mark Miller-3
In reply to this post by Sameer Maggon
Allow me to amend my last email: you should not be making those changes
in QueryParser.java but in QueryParser.jj (line 866) . Also, I know you
did not want to subclass QueryParser, but you cannot know you are in
quotes in the analyzer unless you hook into QueryParser.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Quotes dependent StopWords removal

duiduder
In reply to this post by Sameer Maggon
Hello Sameer,

what about this:

- during indexing, use the StandardAnalyzer without stopwords
- during the search, use 2 different Analyzers - one with and one without stopwords. Thereyby, you look first whether the user
  has typed in quotes inside her query String.
  # If so, look whether there are stopwords between the quotes
    * in the case there is a stopword between quotes, use the Analyzer without stopwords
    * in the case there is no stopword between quotes, use the one with stopwords
  # If not, use the one with stopwords anyway

...the lack on this approach is that when a user mix up stopwords queries with and without quotes, you can not decide such easily-
maybe there a solution can be to modify the analyzer stopword lists on the fly...then the last problem left is when the user types
a specific stopword twice - with and without quotes..so maybe you can live in this situation to use the Analyzer without stopwords -
depending on your scenario, it could be a good compromise...or search n times - but this wouldn't straight forward also ;)


greetz

Christian



Sameer Maggon schrieb:

> Currently, in my application (that uses Lucene), I am using a Porter + StandardAnalyzer (with stop words).
>
>
>
> I would like to do the following:
>
> When the user performs a search, the analyzer should remove the "stopwords" only if the stop word is not present in quotes. If the stop word is present in quotes, I don't want the stop word to be removed by the analyzer.
>
>
>
> For e.g.
>
>
>
> "no dress code" - should not remove "no"  as it's present in quotes.
>
>
>
> shirts with trousers - should remove "with" as a stop word.
>
>
>
> I have been trying to do this with Lucene, but have not found a straight forward way of doing it. I have been digging in Lucene mail archives, but it seems like there is no easy way to do this apart from extending / modifying the QueryParser. In some sense, it is similar to the issue discussed in:
>
>
>
> http://www.gossamer-threads.com/lists/lucene/java-user/38946
>
>
>
> Is there any way I can avoid subclassing QueryParser ?
>
>
>
> Thanks,
>
> Sameer Maggon.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Quotes dependent StopWords removal

Mark Miller-3
This keeps popping back into my head. A little more info for you. Bear
in mind I have not dealt with the QueryParser before.

Use the approach I gave last time. Pull out the QueryParser and change
either QueryParser.jj or QueryParser.java...you may be able to just
change QueryParser.java and avoid having to recompile the JavaCC grammer
file. Now at line 992 of QueryParser.java (different line in
QueryParser.jj) you will see the line:

q = getFieldQuery(field, term.image.substring(1, term.image.length()-1), s);

This is still using the same strategy I mentioned last time. The query
parser will analyze the text passed into the getFieldQuery function.
This particular call of getFieldQuery is made when the query parser sees
a quoted set of tokens (remember...I'm half guessing on all of this...I
don't know). The analyzer used by getFieldQuery is stored in the
QueryParser member variable analyzer. So a possible solution is to save
the member variable analyzer to a local variable and replace it with a
non stop word using analyzer right before the getFieldQuery call.
Restore the original analyzer to the analyzer member variable after the
call.

This may work for you.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]