Phrase search using quotes -- special Tokenizer

classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
Just as you, I would PREFER not to change any of the base Lucene code -- and I imagine there is still some way to do what I want (possibly by extending some other existing class) with what is already available.  

Regarding point 0) -- You are right in that if I add "test phrase" to index as UN_TOKENIZED, and then search on "test" or "phrase" individually, it will not find them (unless they have been added separately by themselves) -- this behavior is actually desirable in my case -- I am adding single keywords (or phrases) to the document field (i.e. there is no sentence-type text), and I want the search to return only results that have the keyword (or phrase) that I added.  (Although at this point, I'm more concerned about being able to return results when I search for an UN_TOKENIZED phrase that was added, I actually would like to "normalize" a phrase (spaces) or a hyphenated word or an underscored word to the same value -- e.g. MS-WORD or ms_WORd or "MS Word" --> ms_word.

Regarding point 1) -- If there was a way to add the keywords (including phrases with spaces) to the index, such that I could search for them using a query returned by QueryParser.parse(<query_string>), I think this would suit my needs.  I looked at (and ran) your Scratch example.  I almost think it would work for my purpose, EXCEPT that, let's say I added 3 documents...

doc.add(new Field("keyword", "hyphenated-word", Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("keyword", "underscored-word", Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("keyword", "phrase with spaces", Field.Store.YES, Field.Index.TOKENIZED));

If I add them as TOKENIZED, a search on "phrase" would return 1 hit, which is NOT really what I want.  I want a hit for "phrase with spaces", but not "phrase" or "spaces".  


Perhaps you understand my situation a bit more now and could provide some additional insight.  Basically, whatever is added to the keyword field is what will be used in the search (plus some other field values).  

Thanks.


Erick Erickson wrote
Disclaimer: Of course I'm not as familiar with your problem space as you
are, so I may be way out in left field, but...

I *still* think you're making waaaaaay too much work for yourself and need
to examine your assumptions.

0> But when you index something UN_TOKENIZED as in your example, I don't
think you'll  find the words "phrase" and "test" if you just search for them
individually.

1> "doesn't parse apart phrases". Why do you care? The "usual" way of
handling this is to just go ahead and parse them apart, then submit your
query with quotes embedded. Would that serve your needs? If not, I bet you
could make a cool regex that would do this for you and use a
PatternAnalyzer. If neither of those work, instead of hacking Lucene code,
make your own tokenizer by overriding one of the Lucene tokenizers. See
Lucene in Action for example, the section on synonyms as I remember.....
It'll be waaaaay less work in the long run than having to try to stay in
synch by hacking the Lucene code. Especially for the next poor soul who has
to maintain it........

2> doesn't parse/separate underscores and hyphens. Why not use a
PatternAnalyzer? You can make it do most anything you want with a clever
regex.

3> the other thing that might be producing unexpected results is that the
default is OR for QueryParser.......

Since no amount of documentation replaces a program, I've included one
illustrating what I'm talking about. It's less likely we're talk past each
other this way. All it may *really* demonstrate is that I don't understand
what you're trying to do at all, but at least we'll know <G>...

Notice in particular that it finds the hypenated words, but doesn't find
their individual parts. Also,  quoted phrasew are found, even though the
stop words aren't in the index (examined via luke). I used Eclipse to
compile/run it......

Best
Erick

package scratch;

import java.io.IOException;
import java.util.regex.Pattern;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.index.memory.PatternAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;

public class Scratch {

    private Analyzer analyzer = null;

    private Analyzer getAnalyzer() {
        if (analyzer == null) {
            // Break on whitespace or anything that is NOT hyphen,
underscore
            // or letter.
            Pattern pat = Pattern.compile("(\\s|[^-_a-zA-Z])");
            analyzer = new PatternAnalyzer(pat, true,
                    StopFilter.makeStopSet(STOP_WORDS));
        }
        return analyzer;

    }

    private void makeTestIndex() throws Exception {
        IndexWriter writer = new IndexWriter("C:/mydir", getAnalyzer(),
true);
        String text = ("this is the test text hyphenated-Iamhyphenated
underscored_IaMunDerScoRed");
        Document doc = new Document();
        doc = new Document();
        doc.add(new Field("test", text, Field.Store.YES,
                        Field.Index.TOKENIZED));
        writer.addDocument(doc);

        writer.close();
    }

    private void doSearch(String query, int expectedHits) throws Exception {
        try {
            QueryParser qp = new QueryParser("test", getAnalyzer());
            qp.enable_tracing();
            IndexSearcher srch = new IndexSearcher("C:/mydir");
            Query tmp = qp.parse(query);
            // Uncomment to see parsed form of query
            // System.out.println("Parsed form is '" + tmp.toString() +
"'");
            Hits hits = srch.search(tmp);

            String msg = "";

            if (hits.length() == expectedHits) {
                msg = "Test passed ";
            } else {
                msg = "************TEST FAILED************ ";
            }
            System.out.println(msg + "Expected "
                    + Integer.toString(expectedHits) + " hits, got "
                    + Integer.toString(hits.length()) + " hits");

        } catch (IOException e) {
            System.out.println("Caught IOException");
            e.printStackTrace();
        }
    }

    private void doSearchPhrase(String phrase, int expectedHits) throws
Exception {
        doSearch("\"" + phrase + "\"", expectedHits);
    }
    public static void main(String[] args) {
        try {
            Scratch scratch = new Scratch();
            scratch.getAnalyzer();
            scratch.makeTestIndex();
            scratch.doSearch("underscored_iamunderscored", 1);
            scratch.doSearch("underscored iamunderscored", 0);
            scratch.doSearch("hyphenated-iamhyphenated", 1);
            scratch.doSearch("iamunderscored", 0);
            scratch.doSearch("underscored", 0);

            scratch.doSearchPhrase("this is the test text", 1);
            scratch.doSearchPhrase("text with hyphenated-iamhyphenated", 1);
            scratch.doSearchPhrase("text with underscored_iamunderscored",
0);


        } catch (Exception e) {
            System.err.println(e.getMessage());
        }
    }

    protected static final String[] STOP_WORDS = { "a", "an", "and", "are",
            "as", "at", "b", "be", "but", "by", "c", "d", "e", "f", "for",
            "if", "g", "h", "i", "in", "into", "is", "it", "j", "k", "l",
"m",
            "n", "no", "not", "o", "of", "on", "or", "p", "q", "r", "s",
            "such", "t", "that", "the", "their", "then", "there", "these",
            "they", "this", "to", "u", "v", "w", "was", "will", "with", "x",
            "y", "z" };

}



On 9/2/06, Philip Brown <pmb@us.ibm.com> wrote:
>
>
> I tend to agree with Mark.  I tried a query as so...
>
>    TermQuery query = new TermQuery(new Term("keywordField", "phrase
> test"));
>    IndexSearcher searcher= new IndexSearcher(activeIdx);
>    Hits hits = searcher.search(query);
>
> And this produced the expected results.  When building the index, I did
> NOT
> enclose the keywords in quotes -- just added as UN_TOKENIZED.
>
> Philip
>
>
> Mark Miller-5 wrote:
> >
> > I think if he wants to use the queryparser to parse his search strings
> > that he has no choice but to modify it. It will eat any pair of quotes
> > going through it no matter what analyzer is used.
> >
> > - Mark
> >> Well, you're flying blind. Is the behavior rooted in the indexing or
> >> querying? Since you can't answer that, you're reduced to trying random
> >> things hoping that one of them works. A little like voodoo. I've wasted
> >> faaaaarrrrrr too much time trying to solve what I was *sure* was the
> >> problem
> >> only to find it was somewhere else (the last place I look, of course)
> >> <G>...
> >>
> >> Using Luke on a RAMDir. No, I don't know how to, but it should be a
> >> simple
> >> thing to write the index to an FSDir at the same time you create your
> >> RAMDir
> >> and use Luke then. This is debugging, after all.
> >>
> >> I'd be really, really, really reluctant to modify the query parser
> and/or
> >> the tokenizer, since whenever I've been tempted it's usually because I
> >> don't
> >> understand the tools already provided. Then I have to maintain my
> custom
> >> code. Which sucks. Although it sure feels more productive to hack a
> >> bunch of
> >> code and get something that works 90% of the time, then spend weeks
> >> making
> >> the other 10% work than taking two days to find the 3 lines you
> *really*
> >> need <G>.
> >>
> >> Have you thought of a PatternAnalyzer? It takes a regular expression
> >> as the
> >> tokenizer and  (from the Javadoc)
> >> <<< Efficient Lucene analyzer/tokenizer that preferably operates on a
> >> String
> >> rather than a
> >> Reader<http://java.sun.com/j2se/1.4/docs/api/java/io/Reader.html>,
> >> that can flexibly separate text into terms via a regular expression
> >> Pattern<
> http://java.sun.com/j2se/1.4/docs/api/java/util/regex/Pattern.html>(with
> >>
> >> behaviour identical to
> >> String.split(String)<
> http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html#split%28java.lang.String%29
> >),
> >>
> >> and that combines the functionality of
> >> LetterTokenizer<
> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LetterTokenizer.html
> >,
> >>
> >> LowerCaseTokenizer<
> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LowerCaseTokenizer.html
> >,
> >>
> >> WhitespaceTokenizer<
> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/WhitespaceTokenizer.html
> >,
> >>
> >> StopFilter<
> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/StopFilter.html
> >into
> >>
> >> a single efficient multi-purpose class.>>>
> >>
> >> One word of caution, the regular expression consists of expressions
> that
> >> *break* tokens, not expressions that *form* words, which threw me at
> >> first.
> >> Just like the doc says, like splitstring <G>.... This is in 2.0,
> >> although I
> >> *believe* it's also in the contrib section of 1.9 (or is in the
> >> regular API,
> >> I forget).
> >>
> >> Best
> >> Erick
> >>
> >> On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
> >>>
> >>>
> >>> No, I've never used Luke.  Is there an easy way to examine my
> >>> RAMDirectory
> >>> index?  I can create the index with no quoted keywords, and when I
> >>> search
> >>> for a keyword, I get back the expected results (just can't search for
> a
> >>> phrase that has whitespace in it).  If I create the index with
> >>> phrases in
> >>> quotes, then when I search for anything in double quotes, I get back
> >>> nothing.  If I create the index with everything in quotes, then when I
> >>> search for anything by the keyword field, I get nothing, regardless of
> >>> whether I use quotes in the query string or not.  (I can get results
> >>> back
> >>> by
> >>> searching on other fields.)  What do you think?
> >>>
> >>> Philip
> >>>
> >>>
> >>> Erick Erickson wrote:
> >>> >
> >>> > OK, I've gotta ask. Have you examined your index with Luke to see if
> >>> what
> >>> > you *think* is in the index actually *is*???
> >>> >
> >>> > Erick
> >>> >
> >>> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
> >>> >>
> >>> >>
> >>> >> Interesting...just ran a test where I put double quotes around
> >>> everything
> >>> >> (including single keywords) of source text and then ran searches
> >>> for a
> >>> >> known
> >>> >> keyword with and without double quotes -- doesn't find either time.
> >>> >>
> >>> >>
> >>> >> Mark Miller-5 wrote:
> >>> >> >
> >>> >> > Sorry to hear you're having trouble. You indeed need the double
> >>> quotes
> >>> >> in
> >>> >> > the source text. You will also need them in the query string.
> Make
> >>> sure
> >>> >> > they
> >>> >> > are in both places. My machine is hosed right now or I would do
> it
> >>> for
> >>> >> you
> >>> >> > real quick. My guess is that I forgot to mention...no only do you
> >>> need
> >>> >> to
> >>> >> > add the <QUOTED> definiton to the TOKEN section, but below that
> you
> >>> >> will
> >>> >> > find the grammer...you need to add <QUOTED> to the grammer. If
> you
> >>> look
> >>> >> > how
> >>> >> > <NUM> and <APOSTROPHE> are done you will prob see what you
> >>> should do.
> >>> >> If
> >>> >> > not, my machine should be back up tomarrow...
> >>> >> >
> >>> >> > - Mark
> >>> >> >
> >>> >> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
> >>> >> >>
> >>> >> >>
> >>> >> >> Well, I tried that, and it doesn't seem to work still.  I would
> be
> >>> >> happy
> >>> >> >> to
> >>> >> >> zip up the new files, so you can see what I'm using -- maybe
> >>> you can
> >>> >> get
> >>> >> >> it
> >>> >> >> to work.  The first time, I tried building the documents without
> >>> >> quotes
> >>> >> >> surrounding each phrase.  Then, I retried by enclosing every
> >>> phrase
> >>> >> >> within
> >>> >> >> double quotes.  Neither seemed to work.  When constructing the
> >>> query
> >>> >> >> string
> >>> >> >> for the search, I always added the double quotes (otherwise,
> it'd
> >>> >> think
> >>> >> >> it
> >>> >> >> was multiple terms).  (I didn't even test the underscore and
> >>> >> hyphenated
> >>> >> >> terms.)  I thought Lucene was (sort of by default) set up to
> >>> search
> >>> >> >> quoted
> >>> >> >> phrases.  From
> >>> http://lucene.apache.org/java/docs/api/index.html -->
> >>> A
> >>> >> >> Phrase is a group of words surrounded by double quotes such as
> >>> "hello
> >>> >> >> dolly".  So, this should be easy, right?  I must be missing
> >>> something
> >>> >> >> stupid.
> >>> >> >>
> >>> >> >> Thanks,
> >>> >> >>
> >>> >> >> Philip
> >>> >> >>
> >>> >> >>
> >>> >> >> Mark Miller-5 wrote:
> >>> >> >> >
> >>> >> >> > So this will recognize anything in quotes as a single token
> and
> >>> '_'
> >>> >> and
> >>> >> >> > '-' will not break up words. There may be some repercussions
> for
> >>> the
> >>> >> >> NUM
> >>> >> >> > token but nothing I'd worry about. maybe you want to use
> Unicode
> >>> for
> >>> >> >> '-'
> >>> >> >> > and '_' as well...I wouldn't worry about it myself.
> >>> >> >> >
> >>> >> >> > - Mark
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > TOKEN : {                      // token patterns
> >>> >> >> >
> >>> >> >> >   // basic word: a sequence of digits & letters
> >>> >> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >>> >> >> >
> >>> >> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >>> >> >> >
> >>> >> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >>> >> >> >   // use a post-filter to remove possesives
> >>> >> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >>> >> >> >
> >>> >> >> >   // acronyms: U.S.A., I.B.M., etc.
> >>> >> >> >   // use a post-filter to remove dots
> >>> >> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >>> >> >> >
> >>> >> >> >   // company names like AT&T and Excite@Home.
> >>> >> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >>> >> >> >
> >>> >> >> >   // email addresses
> >>> >> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@"
> <ALPHANUM>
> >>> >> >> > (("."|"-") <ALPHANUM>)+ >
> >>> >> >> >
> >>> >> >> >   // hostname
> >>> >> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >>> >> >> >
> >>> >> >> >   // floating point, serial, model numbers, ip addresses, etc.
> >>> >> >> >   // every other segment must have at least one digit
> >>> >> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >>> >> >> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >>> >> >> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >>> >> >> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
> >>> >> <HAS_DIGIT>)+
> >>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
> >>> >> <ALPHANUM>)+
> >>> >> >> >         )
> >>> >> >> >   >
> >>> >> >> > | <#P: ("_"|"-"|"/"|"."|",") >
> >>> >> >> > | <#HAS_DIGIT:                      // at least one digit
> >>> >> >> >     (<LETTER>|<DIGIT>)*
> >>> >> >> >     <DIGIT>
> >>> >> >> >     (<LETTER>|<DIGIT>)*
> >>> >> >> >   >
> >>> >> >> >
> >>> >> >> > | < #ALPHA: (<LETTER>)+>
> >>> >> >> > | < #LETTER:                      // unicode letters
> >>> >> >> >       [
> >>> >> >> >        "\u0041"-"\u005a",
> >>> >> >> >        "\u0061"-"\u007a",
> >>> >> >> >        "\u00c0"-"\u00d6",
> >>> >> >> >        "\u00d8"-"\u00f6",
> >>> >> >> >        "\u00f8"-"\u00ff",
> >>> >> >> >        "\u0100"-"\u1fff",
> >>> >> >> >        "-", "_"
> >>> >> >> >       ]
> >>> >> >> >   >
> >>> >> >> >
> >>> >> >> >
> >>> >>
> ---------------------------------------------------------------------
> >>> >> >> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> >>> >> >> > For additional commands, e-mail:
> >>> java-user-help@lucene.apache.org
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> >> >>
> >>> >> >> --
> >>> >> >> View this message in context:
> >>> >> >>
> >>> >>
> >>>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> >>>
> >>> >> >> Sent from the Lucene - Java Users forum at Nabble.com.
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> ---------------------------------------------------------------------
> >>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> >> >> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >> >
> >>> >>
> >>> >> --
> >>> >> View this message in context:
> >>> >>
> >>>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
> >>>
> >>> >> Sent from the Lucene - Java Users forum at Nabble.com.
> >>> >>
> >>> >>
> >>> >>
> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067
> >>>
> >>> Sent from the Lucene - Java Users forum at Nabble.com.
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6115360
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Chris Hostetter-3

I haven't really been following this thread, but it's gotten so long
i got interested.

from whta i can tell skimming the discussion so far, it seems like the
biggest confusion is about the definition of a "phrase" and what analyzers
do with "quote" characters and what the QueryParser does with "quote"
charcters -- when ultimately you don't seem to really care about "phrases"
in a textual searching sense; nor do you seem to care about any of the
"features" of the QueryParser.

it seems that what you care about is:

 1) making documents, and adding a list of "text chunks" to those
    documents (what you've been calling phrases)
 2) you then want to be able to search for "almost-exact" matches on those
    "text chunks" ... these matches should be "exactish" because you don't
    want partial matches based on white spaces, or splitting on hyphens,
    but they shouldn't be truely exact because you want some simple
    normalization...

: actually would like to "normalize" a phrase (spaces) or a hyphenated word or
: an underscored word to the same value -- e.g. MS-WORD or ms_WORd or "MS
: Word" --> ms_word.

...in which case, you should:
 a) write yourself an analyzer which does no "tokenizing" (ie: each input
    Field value generates a single token) but does the normalization you
    want.
 b) use this Analyzer when you add the fields to your documents, even
    though you don't want *real* tokenization, add make the field type
    TOKENIZED so your analyzer gets used.
 c) when you get some text input to serach on, pass it to the same
    Analyzer, take the Token you get back and manualy construct a
    TermQuery out of it for the neccessary field.

...that's it.  that's all she wrote -- don't even look in QueryParser's
general direction, at all.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Erick Erickson
Yeah, what he said <G>....

On 9/3/06, Chris Hostetter <[hidden email]> wrote:

>
>
> I haven't really been following this thread, but it's gotten so long
> i got interested.
>
> from whta i can tell skimming the discussion so far, it seems like the
> biggest confusion is about the definition of a "phrase" and what analyzers
> do with "quote" characters and what the QueryParser does with "quote"
> charcters -- when ultimately you don't seem to really care about "phrases"
> in a textual searching sense; nor do you seem to care about any of the
> "features" of the QueryParser.
>
> it seems that what you care about is:
>
> 1) making documents, and adding a list of "text chunks" to those
>     documents (what you've been calling phrases)
> 2) you then want to be able to search for "almost-exact" matches on those
>     "text chunks" ... these matches should be "exactish" because you don't
>     want partial matches based on white spaces, or splitting on hyphens,
>     but they shouldn't be truely exact because you want some simple
>     normalization...
>
> : actually would like to "normalize" a phrase (spaces) or a hyphenated
> word or
> : an underscored word to the same value -- e.g. MS-WORD or ms_WORd or "MS
> : Word" --> ms_word.
>
> ...in which case, you should:
> a) write yourself an analyzer which does no "tokenizing" (ie: each input
>     Field value generates a single token) but does the normalization you
>     want.
> b) use this Analyzer when you add the fields to your documents, even
>     though you don't want *real* tokenization, add make the field type
>     TOKENIZED so your analyzer gets used.
> c) when you get some text input to serach on, pass it to the same
>     Analyzer, take the Token you get back and manualy construct a
>     TermQuery out of it for the neccessary field.
>
> ...that's it.  that's all she wrote -- don't even look in QueryParser's
> general direction, at all.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
In reply to this post by Chris Hostetter-3
Thanks for your input.  I'm sure I could do as you suggest (and maybe that will end up being my best option), but I had hoped to use a string for creating the query object, particularly as some of my queries are a bit complex.

Thanks.

Chris Hostetter wrote
I haven't really been following this thread, but it's gotten so long
i got interested.

from whta i can tell skimming the discussion so far, it seems like the
biggest confusion is about the definition of a "phrase" and what analyzers
do with "quote" characters and what the QueryParser does with "quote"
charcters -- when ultimately you don't seem to really care about "phrases"
in a textual searching sense; nor do you seem to care about any of the
"features" of the QueryParser.

it seems that what you care about is:

 1) making documents, and adding a list of "text chunks" to those
    documents (what you've been calling phrases)
 2) you then want to be able to search for "almost-exact" matches on those
    "text chunks" ... these matches should be "exactish" because you don't
    want partial matches based on white spaces, or splitting on hyphens,
    but they shouldn't be truely exact because you want some simple
    normalization...

: actually would like to "normalize" a phrase (spaces) or a hyphenated word or
: an underscored word to the same value -- e.g. MS-WORD or ms_WORd or "MS
: Word" --> ms_word.

...in which case, you should:
 a) write yourself an analyzer which does no "tokenizing" (ie: each input
    Field value generates a single token) but does the normalization you
    want.
 b) use this Analyzer when you add the fields to your documents, even
    though you don't want *real* tokenization, add make the field type
    TOKENIZED so your analyzer gets used.
 c) when you get some text input to serach on, pass it to the same
    Analyzer, take the Token you get back and manualy construct a
    TermQuery out of it for the neccessary field.

...that's it.  that's all she wrote -- don't even look in QueryParser's
general direction, at all.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Chris Hostetter-3

: Thanks for your input.  I'm sure I could do as you suggest (and maybe that
: will end up being my best option), but I had hoped to use a string for
: creating the query object, particularly as some of my queries are a bit
: complex.

you have to clarify what you mean by "use a string for creating the query
object" ... there's nothing in what i suggested that implies you can't do
that, that's exactly what i'm suggesting you do...

   String input = ...;
   Analyzer a = new YourCustomAnalyzer();
   // because you know your analyzer allways produces exactly one token...
   Token t = a.tokenStream("yourField", new StringReader(input)).next();
   Query yourQuery = new TermQuery("yourField", t.termText());

...if your queries are more complex then just the "exactish" matching you
described before, then that's a seperate issue -- what you described
didn't sound like it required any special input processing -- you said you
had a "string" and you wanted to find exact matches on that string (with
some normalization) ... but that you didn't want your input split on
whitespace, or hyphens, or any of the "special" characters QueryParser
uses.

If you want other things then that certainly makes things more
complicated, but the basic idea is still the same ... so what exactly do
you mean when you say it's more complicated?


: > I haven't really been following this thread, but it's gotten so long
: > i got interested.
: >
: > from whta i can tell skimming the discussion so far, it seems like the
: > biggest confusion is about the definition of a "phrase" and what analyzers
: > do with "quote" characters and what the QueryParser does with "quote"
: > charcters -- when ultimately you don't seem to really care about "phrases"
: > in a textual searching sense; nor do you seem to care about any of the
: > "features" of the QueryParser.
: >
: > it seems that what you care about is:
: >
: >  1) making documents, and adding a list of "text chunks" to those
: >     documents (what you've been calling phrases)
: >  2) you then want to be able to search for "almost-exact" matches on those
: >     "text chunks" ... these matches should be "exactish" because you don't
: >     want partial matches based on white spaces, or splitting on hyphens,
: >     but they shouldn't be truely exact because you want some simple
: >     normalization...
: >
: > : actually would like to "normalize" a phrase (spaces) or a hyphenated
: > word or
: > : an underscored word to the same value -- e.g. MS-WORD or ms_WORd or "MS
: > : Word" --> ms_word.
: >
: > ...in which case, you should:
: >  a) write yourself an analyzer which does no "tokenizing" (ie: each input
: >     Field value generates a single token) but does the normalization you
: >     want.
: >  b) use this Analyzer when you add the fields to your documents, even
: >     though you don't want *real* tokenization, add make the field type
: >     TOKENIZED so your analyzer gets used.
: >  c) when you get some text input to serach on, pass it to the same
: >     Analyzer, take the Token you get back and manualy construct a
: >     TermQuery out of it for the neccessary field.
: >
: > ...that's it.  that's all she wrote -- don't even look in QueryParser's
: > general direction, at all.
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: [hidden email]
: > For additional commands, e-mail: [hidden email]
: >
: >
: >
:
: --
: View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6128827
: Sent from the Lucene - Java Users forum at Nabble.com.
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
Keeping in mind that Hoss's input is much more valuable than mine...

It sounds like you want what I originally tgave you. You want to be able
to perform complex queries with the QueryParser and you want '-' and '_'
to not break words, and you want quoted words to be tokenized as one
token with no extra processing. Eric's concerns are obviously
valid...but you are not hacking the lucene code for the new Standard
Analyzer I hope...pull it out and make a your own...that should just be
a custom analyzer that acts a lot like the standard analyzer. As far as
the query parser...if you want to be able to mix normal searching with
your quoted requirements you are going to have to make your own
queryparser...no fun on that...so why not pull the queryparser
out...make the single line change...and later, if the queryparser is
updated...take the new one and make the single line change. Not that big
of a nightmare.

Hopefully Hoss can give you something better...but from what I
understand you want the queryparser language and you want your quotes
deal...and they do not go together without the change I gave you. If
their is another way to do it, its hard to believe it will be easier
than maintaining a single line change in QueryParser.

Keep in mind I am a lucene beginner. Both Hoss and Erick are more
knowledgeable than I am about Lucene. Just putting in my two sense.

- Mark

Chris Hostetter wrote:

> : Thanks for your input.  I'm sure I could do as you suggest (and maybe that
> : will end up being my best option), but I had hoped to use a string for
> : creating the query object, particularly as some of my queries are a bit
> : complex.
>
> you have to clarify what you mean by "use a string for creating the query
> object" ... there's nothing in what i suggested that implies you can't do
> that, that's exactly what i'm suggesting you do...
>
>    String input = ...;
>    Analyzer a = new YourCustomAnalyzer();
>    // because you know your analyzer allways produces exactly one token...
>    Token t = a.tokenStream("yourField", new StringReader(input)).next();
>    Query yourQuery = new TermQuery("yourField", t.termText());
>
> ...if your queries are more complex then just the "exactish" matching you
> described before, then that's a seperate issue -- what you described
> didn't sound like it required any special input processing -- you said you
> had a "string" and you wanted to find exact matches on that string (with
> some normalization) ... but that you didn't want your input split on
> whitespace, or hyphens, or any of the "special" characters QueryParser
> uses.
>
> If you want other things then that certainly makes things more
> complicated, but the basic idea is still the same ... so what exactly do
> you mean when you say it's more complicated?
>
>
> : > I haven't really been following this thread, but it's gotten so long
> : > i got interested.
> : >
> : > from whta i can tell skimming the discussion so far, it seems like the
> : > biggest confusion is about the definition of a "phrase" and what analyzers
> : > do with "quote" characters and what the QueryParser does with "quote"
> : > charcters -- when ultimately you don't seem to really care about "phrases"
> : > in a textual searching sense; nor do you seem to care about any of the
> : > "features" of the QueryParser.
> : >
> : > it seems that what you care about is:
> : >
> : >  1) making documents, and adding a list of "text chunks" to those
> : >     documents (what you've been calling phrases)
> : >  2) you then want to be able to search for "almost-exact" matches on those
> : >     "text chunks" ... these matches should be "exactish" because you don't
> : >     want partial matches based on white spaces, or splitting on hyphens,
> : >     but they shouldn't be truely exact because you want some simple
> : >     normalization...
> : >
> : > : actually would like to "normalize" a phrase (spaces) or a hyphenated
> : > word or
> : > : an underscored word to the same value -- e.g. MS-WORD or ms_WORd or "MS
> : > : Word" --> ms_word.
> : >
> : > ...in which case, you should:
> : >  a) write yourself an analyzer which does no "tokenizing" (ie: each input
> : >     Field value generates a single token) but does the normalization you
> : >     want.
> : >  b) use this Analyzer when you add the fields to your documents, even
> : >     though you don't want *real* tokenization, add make the field type
> : >     TOKENIZED so your analyzer gets used.
> : >  c) when you get some text input to serach on, pass it to the same
> : >     Analyzer, take the Token you get back and manualy construct a
> : >     TermQuery out of it for the neccessary field.
> : >
> : > ...that's it.  that's all she wrote -- don't even look in QueryParser's
> : > general direction, at all.
> : >
> : >
> : >
> : > -Hoss
> : >
> : >
> : > ---------------------------------------------------------------------
> : > To unsubscribe, e-mail: [hidden email]
> : > For additional commands, e-mail: [hidden email]
> : >
> : >
> : >
> :
> : --
> : View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6128827
> : Sent from the Lucene - Java Users forum at Nabble.com.
> :
> :
> : ---------------------------------------------------------------------
> : To unsubscribe, e-mail: [hidden email]
> : For additional commands, e-mail: [hidden email]
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
In reply to this post by Chris Hostetter-3
Yeah, they are more complex than the "exactish" match -- basically, there are more fields involved -- combined sometimes with AND and sometimes with OR, and sometimes negated field values, sometimes groupings, etc.  These other field values are all single words (no spaces), and a search might involve a wildcard on them.  Hope that helps.

Thanks.

Chris Hostetter wrote
: Thanks for your input.  I'm sure I could do as you suggest (and maybe that
: will end up being my best option), but I had hoped to use a string for
: creating the query object, particularly as some of my queries are a bit
: complex.

you have to clarify what you mean by "use a string for creating the query
object" ... there's nothing in what i suggested that implies you can't do
that, that's exactly what i'm suggesting you do...

   String input = ...;
   Analyzer a = new YourCustomAnalyzer();
   // because you know your analyzer allways produces exactly one token...
   Token t = a.tokenStream("yourField", new StringReader(input)).next();
   Query yourQuery = new TermQuery("yourField", t.termText());

...if your queries are more complex then just the "exactish" matching you
described before, then that's a seperate issue -- what you described
didn't sound like it required any special input processing -- you said you
had a "string" and you wanted to find exact matches on that string (with
some normalization) ... but that you didn't want your input split on
whitespace, or hyphens, or any of the "special" characters QueryParser
uses.

If you want other things then that certainly makes things more
complicated, but the basic idea is still the same ... so what exactly do
you mean when you say it's more complicated?


: > I haven't really been following this thread, but it's gotten so long
: > i got interested.
: >
: > from whta i can tell skimming the discussion so far, it seems like the
: > biggest confusion is about the definition of a "phrase" and what analyzers
: > do with "quote" characters and what the QueryParser does with "quote"
: > charcters -- when ultimately you don't seem to really care about "phrases"
: > in a textual searching sense; nor do you seem to care about any of the
: > "features" of the QueryParser.
: >
: > it seems that what you care about is:
: >
: >  1) making documents, and adding a list of "text chunks" to those
: >     documents (what you've been calling phrases)
: >  2) you then want to be able to search for "almost-exact" matches on those
: >     "text chunks" ... these matches should be "exactish" because you don't
: >     want partial matches based on white spaces, or splitting on hyphens,
: >     but they shouldn't be truely exact because you want some simple
: >     normalization...
: >
: > : actually would like to "normalize" a phrase (spaces) or a hyphenated
: > word or
: > : an underscored word to the same value -- e.g. MS-WORD or ms_WORd or "MS
: > : Word" --> ms_word.
: >
: > ...in which case, you should:
: >  a) write yourself an analyzer which does no "tokenizing" (ie: each input
: >     Field value generates a single token) but does the normalization you
: >     want.
: >  b) use this Analyzer when you add the fields to your documents, even
: >     though you don't want *real* tokenization, add make the field type
: >     TOKENIZED so your analyzer gets used.
: >  c) when you get some text input to serach on, pass it to the same
: >     Analyzer, take the Token you get back and manualy construct a
: >     TermQuery out of it for the neccessary field.
: >
: > ...that's it.  that's all she wrote -- don't even look in QueryParser's
: > general direction, at all.
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
: >
:
: --
: View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6128827
: Sent from the Lucene - Java Users forum at Nabble.com.
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
More to consider:
perhaps there is some way to get what you want by overriding
getFieldQuery(String, String) instead. I have not been able to come up
with anything simple off the top of my head, but overriding
getFieldQuery would free you from having to make that line change on
every Lucene update. Perhaps you could scan the string in getFieldQuery
and if you find a space, skip the analyzing and return a term query,
putting the quotes back on -- good old boy might come in and you pump
out "good old boy" as a term query. If there is no space in the query
part then analyze like normal.

- Mark
Philip Brown wrote:

> Yeah, they are more complex than the "exactish" match -- basically, there are
> more fields involved -- combined sometimes with AND and sometimes with OR,
> and sometimes negated field values, sometimes groupings, etc.  These other
> field values are all single words (no spaces), and a search might involve a
> wildcard on them.  Hope that helps.
>
> Thanks.
>
>
> Chris Hostetter wrote:
>  
>> : Thanks for your input.  I'm sure I could do as you suggest (and maybe
>> that
>> : will end up being my best option), but I had hoped to use a string for
>> : creating the query object, particularly as some of my queries are a bit
>> : complex.
>>
>> you have to clarify what you mean by "use a string for creating the query
>> object" ... there's nothing in what i suggested that implies you can't do
>> that, that's exactly what i'm suggesting you do...
>>
>>    String input = ...;
>>    Analyzer a = new YourCustomAnalyzer();
>>    // because you know your analyzer allways produces exactly one token...
>>    Token t = a.tokenStream("yourField", new StringReader(input)).next();
>>    Query yourQuery = new TermQuery("yourField", t.termText());
>>
>> ...if your queries are more complex then just the "exactish" matching you
>> described before, then that's a seperate issue -- what you described
>> didn't sound like it required any special input processing -- you said you
>> had a "string" and you wanted to find exact matches on that string (with
>> some normalization) ... but that you didn't want your input split on
>> whitespace, or hyphens, or any of the "special" characters QueryParser
>> uses.
>>
>> If you want other things then that certainly makes things more
>> complicated, but the basic idea is still the same ... so what exactly do
>> you mean when you say it's more complicated?
>>
>>
>> : > I haven't really been following this thread, but it's gotten so long
>> : > i got interested.
>> : >
>> : > from whta i can tell skimming the discussion so far, it seems like the
>> : > biggest confusion is about the definition of a "phrase" and what
>> analyzers
>> : > do with "quote" characters and what the QueryParser does with "quote"
>> : > charcters -- when ultimately you don't seem to really care about
>> "phrases"
>> : > in a textual searching sense; nor do you seem to care about any of the
>> : > "features" of the QueryParser.
>> : >
>> : > it seems that what you care about is:
>> : >
>> : >  1) making documents, and adding a list of "text chunks" to those
>> : >     documents (what you've been calling phrases)
>> : >  2) you then want to be able to search for "almost-exact" matches on
>> those
>> : >     "text chunks" ... these matches should be "exactish" because you
>> don't
>> : >     want partial matches based on white spaces, or splitting on
>> hyphens,
>> : >     but they shouldn't be truely exact because you want some simple
>> : >     normalization...
>> : >
>> : > : actually would like to "normalize" a phrase (spaces) or a hyphenated
>> : > word or
>> : > : an underscored word to the same value -- e.g. MS-WORD or ms_WORd or
>> "MS
>> : > : Word" --> ms_word.
>> : >
>> : > ...in which case, you should:
>> : >  a) write yourself an analyzer which does no "tokenizing" (ie: each
>> input
>> : >     Field value generates a single token) but does the normalization
>> you
>> : >     want.
>> : >  b) use this Analyzer when you add the fields to your documents, even
>> : >     though you don't want *real* tokenization, add make the field type
>> : >     TOKENIZED so your analyzer gets used.
>> : >  c) when you get some text input to serach on, pass it to the same
>> : >     Analyzer, take the Token you get back and manualy construct a
>> : >     TermQuery out of it for the neccessary field.
>> : >
>> : > ...that's it.  that's all she wrote -- don't even look in
>> QueryParser's
>> : > general direction, at all.
>> : >
>> : >
>> : >
>> : > -Hoss
>> : >
>> : >
>> : > ---------------------------------------------------------------------
>> : > To unsubscribe, e-mail: [hidden email]
>> : > For additional commands, e-mail: [hidden email]
>> : >
>> : >
>> : >
>> :
>> : --
>> : View this message in context:
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6128827
>> : Sent from the Lucene - Java Users forum at Nabble.com.
>> :
>> :
>> : ---------------------------------------------------------------------
>> : To unsubscribe, e-mail: [hidden email]
>> : For additional commands, e-mail: [hidden email]
>> :
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>    
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Chris Hostetter-3
In reply to this post by Philip Brown

: Yeah, they are more complex than the "exactish" match -- basically, there are
: more fields involved -- combined sometimes with AND and sometimes with OR,
: and sometimes negated field values, sometimes groupings, etc.  These other
: field values are all single words (no spaces), and a search might involve a
: wildcard on them.  Hope that helps.

I'm not seeing any problems with using QueryParser -- what you still need
however is an Analyzer for the fields you want the special treatment on.
if you write that analyzer, combine it with the StandardAnalyzer into a
PerFieldAnalyzer and use that in your IndexWriter and QueryParser you
should be good to go with things like.

if you do that, and it still doesn't work the way you expect, write a
small self contained JUnit test that indexes a few sample docs into a
RAMDirectory index and queries against showing whta you expect to happen
(that isn't working) and send that to the list.

People will be able to give you much better advice once they see some
executable code that illustrates the problems you are having.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
So, if I do as you suggest below (using PerFieldAnalyzerWrapper with StandardAnalyzer) then I still need to enclose in quotes the phrases (keywords with spaces) when I issue the search, and they are only returned in the results if the case is identical to how it was added?  (This seems to be what I observe anyway.  And whether I add as TOKENIZED or UN_TOKENIZED seems to have no effect.)

Thanks.

Chris Hostetter wrote
: Yeah, they are more complex than the "exactish" match -- basically, there are
: more fields involved -- combined sometimes with AND and sometimes with OR,
: and sometimes negated field values, sometimes groupings, etc.  These other
: field values are all single words (no spaces), and a search might involve a
: wildcard on them.  Hope that helps.

I'm not seeing any problems with using QueryParser -- what you still need
however is an Analyzer for the fields you want the special treatment on.
if you write that analyzer, combine it with the StandardAnalyzer into a
PerFieldAnalyzer and use that in your IndexWriter and QueryParser you
should be good to go with things like.

if you do that, and it still doesn't work the way you expect, write a
small self contained JUnit test that indexes a few sample docs into a
RAMDirectory index and queries against showing whta you expect to happen
(that isn't working) and send that to the list.

People will be able to give you much better advice once they see some
executable code that illustrates the problems you are having.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Chris Hostetter-3

: So, if I do as you suggest below (using PerFieldAnalyzerWrapper with
: StandardAnalyzer) then I still need to enclose in quotes the phrases
: (keywords with spaces) when I issue the search, and they are only returned

Yes, quotes will be neccessary to tell the QueryParser "this
is one chunk of text, passs it to the analyzer whole" - but that's so you
can get the "compelx" part of the problem you described... recognizing
that "my brown-cow" and "red fox" should be matched as seperate values
intead of trying to find one big vlaue containing "my brown-cow red fox"

: in the results if the case is identical to how it was added?  (This seems to
: be what I observe anyway.  And whether I add as TOKENIZED or UN_TOKENIZED
: seems to have no effect.)

1) wether case matters is determined enitrely by your analyzer, if it
   produces differnet tokens for "Blue" and "BLUE" then case matters
2) use TOKENIZED or your Analyzer will be completely irrelevant
3) if you observse something working differently then you expect, post the
  code -- we're way pastthe point of being able to offer you any
  meaningful help without seeing a self contained example of what you want
  to see work.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Mark Miller-3
Some info to help you on you're journey :)

1. If you add a field as untokenized then it will not be analyzed when added
to the index. However, QueryParser will not know that this happened and will
tokenize queries on that field.

2. The solution that Hoss has explained to you is to leave the default quote
handling in place. The default quote handling is this:

On Indexing: the analyzers ditch all quotes. As far as the index is concered
they are of no value...postion increments are used instead.

Searching with QueryParser: when the QueryParser detects something in
quotes, it takes whats between the quotes and passes that to
getFieldQuery(). GetFieldQuery than anaylzes the quote chunk sans the
quotes. Stop words are removed, stemming is performed, etc depending on your
analyzer. GetFieldQuery sees that multiple tokens came out of the analyzer
and that the positions between tokens indicate that you are going for a
phrase search. A phrase search is generated. A phrase search with stopwords
removed has interesting sloppy matching. A phrase search can also match out
of order given enough slop. This is normally fine behavior for most
applications I can think of. You need to consider if this is fine behavior
for you. You first mentioned that you only want exact matches to be made on
quoted searches...that you want no stop words removed etc. If there is some
reason you really need this (I don't see it myself) then use the method I
gave you. I would think you should be fine with the normal behavior, but
then I don't know why you asked about this to begin with.

3. If you are mixing quoted data with non-quoted data, a per-field analyzer
won't be of much help. The quoted and unquoted data will be in the save
field I assume. Are you separating the quoted stuff from the non-quoted and
putting them in separate fields?


- Mark
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
In reply to this post by Chris Hostetter-3
Here's a little sample program (borrowed some code from Erick Erickson :)).  Whether I add as TOKENIZED or UN_TOKENIZED seems to make no difference in the output.  Is this what you'd expect?

- Philip

package com.test;

import java.io.IOException;
import java.util.HashSet;
import java.util.regex.Pattern;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.KeywordAnalyzer;
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.memory.PatternAnalyzer;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.RAMDirectory;

public class Test2 {
            private PerFieldAnalyzerWrapper analyzer = null;
            private RAMDirectory idx = null;

            private Analyzer getAnalyzer() {
                if (analyzer == null) {
                analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer());        
                analyzer.addAnalyzer("keyword", new KeywordAnalyzer());
                }
                return analyzer;

            }

            private void makeTestIndex() throws Exception {
                        idx = new RAMDirectory();
                IndexWriter writer = new IndexWriter(idx, getAnalyzer(), true);
                        Document doc = new Document();
                        doc.add(new Field("keyword", "hello world", Field.Store.YES, Field.Index.UN_TOKENIZED));
                        doc.add(new Field("booleanField", "false", Field.Store.YES, Field.Index.UN_TOKENIZED));
                        writer.addDocument(doc);
                        doc = new Document();
                        doc.add(new Field("keyword", "hello world", Field.Store.YES, Field.Index.UN_TOKENIZED));
                        doc.add(new Field("booleanField", "true", Field.Store.YES, Field.Index.UN_TOKENIZED));
                        writer.addDocument(doc);
System.out.println(writer.docCount());
                        writer.optimize();
                        writer.close();
            }

            private void doSearch(String query, int expectedHits) throws Exception {
                try {
                    QueryParser qp = new QueryParser("keyword", getAnalyzer());            
                    IndexSearcher srch = new IndexSearcher(idx);
                    Query tmp = qp.parse(query);
                    // Uncomment to see parsed form of query
                     System.out.println("Parsed form is '" + tmp.toString() + "'");
                    Hits hits = srch.search(tmp);

                    String msg = "";

                    if (hits.length() == expectedHits) {
                        msg = "Test passed ";
                    } else {
                        msg = "************TEST FAILED************ ";
                    }
                    System.out.println(msg + "Expected "
                            + Integer.toString(expectedHits) + " hits, got "
                            + Integer.toString(hits.length()) + " hits");

                } catch (IOException e) {
                    System.out.println("Caught IOException");
                    e.printStackTrace();
                }
            }


            public static void main(String[] args) {
                try {
                    Test2 test = new Test2();  
                    test.makeTestIndex();
                    test.doSearch("Hello World", 0);
                    test.doSearch("hello world", 0);
                    test.doSearch("hello", 0);
                    test.doSearch("world", 0);

                    test.doSearch("\"Hello World\"", 0);
                    test.doSearch("\"hello world\"", 2);  
                    test.doSearch("\"hello world\" +booleanField:false", 1);
                    test.doSearch("\"hello world\" +booleanField:true", 1);

                } catch (Exception e) {
                    System.err.println(e.getMessage());
                }
            }
}

Chris Hostetter wrote
: So, if I do as you suggest below (using PerFieldAnalyzerWrapper with
: StandardAnalyzer) then I still need to enclose in quotes the phrases
: (keywords with spaces) when I issue the search, and they are only returned

Yes, quotes will be neccessary to tell the QueryParser "this
is one chunk of text, passs it to the analyzer whole" - but that's so you
can get the "compelx" part of the problem you described... recognizing
that "my brown-cow" and "red fox" should be matched as seperate values
intead of trying to find one big vlaue containing "my brown-cow red fox"

: in the results if the case is identical to how it was added?  (This seems to
: be what I observe anyway.  And whether I add as TOKENIZED or UN_TOKENIZED
: seems to have no effect.)

1) wether case matters is determined enitrely by your analyzer, if it
   produces differnet tokens for "Blue" and "BLUE" then case matters
2) use TOKENIZED or your Analyzer will be completely irrelevant
3) if you observse something working differently then you expect, post the
  code -- we're way pastthe point of being able to offer you any
  meaningful help without seeing a self contained example of what you want
  to see work.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Chris Hostetter-3

1) consider using JUnit tests .. it makes it a lot easier for other people
to understand your expecations, and if it winds up demonstraing a genuine
bug in Lucene, it's easy to add to the test tree.

2) as i said before, your fields must be TOKENIZED, or your analyzer is
irrelevant at index time.

3) when i run the code you sent as is, i get lots of "Test passed" lines
and no "TEST FAILED" lines ... which makes sense since you have everything
UN_TOKENIZED, so the literal values are getting indexed, which just so
happens to be what KeywwordAnalyzer does as well -- hence if you change
everything from UN_TOKENIZED to TOKENIZED it will still work.


do you have na example of something that *isn't* working the way you want?
... if not i don't see what your problem is, all your tests are passing :)


: Date: Tue, 5 Sep 2006 14:06:13 -0700 (PDT)
: From: Philip Brown <[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: Re: Phrase search using quotes -- special Tokenizer
:
:
: Here's a little sample program (borrowed some code from Erick Erickson :)).
: Whether I add as TOKENIZED or UN_TOKENIZED seems to make no difference in
: the output.  Is this what you'd expect?
:
: - Philip
:
: package com.test;
:
: import java.io.IOException;
: import java.util.HashSet;
: import java.util.regex.Pattern;
:
: import org.apache.lucene.analysis.Analyzer;
: import org.apache.lucene.analysis.KeywordAnalyzer;
: import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
: import org.apache.lucene.analysis.standard.StandardAnalyzer;
: import org.apache.lucene.document.Document;
: import org.apache.lucene.document.Field;
: import org.apache.lucene.index.IndexWriter;
: import org.apache.lucene.index.memory.PatternAnalyzer;
: import org.apache.lucene.queryParser.QueryParser;
: import org.apache.lucene.search.Hits;
: import org.apache.lucene.search.IndexSearcher;
: import org.apache.lucene.search.Query;
: import org.apache.lucene.store.RAMDirectory;
:
: public class Test2 {
:    private PerFieldAnalyzerWrapper analyzer = null;
:    private RAMDirectory idx = null;
:
:    private Analyzer getAnalyzer() {
:        if (analyzer == null) {
:         analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer());
:         analyzer.addAnalyzer("keyword", new KeywordAnalyzer());
:        }
:        return analyzer;
:
:    }
:
:    private void makeTestIndex() throws Exception {
: idx = new RAMDirectory();
:        IndexWriter writer = new IndexWriter(idx, getAnalyzer(), true);
: Document doc = new Document();
: doc.add(new Field("keyword", "hello world", Field.Store.YES,
: Field.Index.UN_TOKENIZED));
: doc.add(new Field("booleanField", "false", Field.Store.YES,
: Field.Index.UN_TOKENIZED));
: writer.addDocument(doc);
: doc = new Document();
: doc.add(new Field("keyword", "hello world", Field.Store.YES,
: Field.Index.UN_TOKENIZED));
: doc.add(new Field("booleanField", "true", Field.Store.YES,
: Field.Index.UN_TOKENIZED));
: writer.addDocument(doc);
: System.out.println(writer.docCount());
: writer.optimize();
: writer.close();
:    }
:
:    private void doSearch(String query, int expectedHits) throws Exception
: {
:        try {
:            QueryParser qp = new QueryParser("keyword", getAnalyzer());
:            IndexSearcher srch = new IndexSearcher(idx);
:            Query tmp = qp.parse(query);
:            // Uncomment to see parsed form of query
:             System.out.println("Parsed form is '" + tmp.toString() + "'");
:            Hits hits = srch.search(tmp);
:
:            String msg = "";
:
:            if (hits.length() == expectedHits) {
:                msg = "Test passed ";
:            } else {
:                msg = "************TEST FAILED************ ";
:            }
:            System.out.println(msg + "Expected "
:                    + Integer.toString(expectedHits) + " hits, got "
:                    + Integer.toString(hits.length()) + " hits");
:
:        } catch (IOException e) {
:            System.out.println("Caught IOException");
:            e.printStackTrace();
:        }
:    }
:
:
:    public static void main(String[] args) {
:        try {
:            Test2 test = new Test2();
:            test.makeTestIndex();
:            test.doSearch("Hello World", 0);
:            test.doSearch("hello world", 0);
:            test.doSearch("hello", 0);
:            test.doSearch("world", 0);
:
:            test.doSearch("\"Hello World\"", 0);
:            test.doSearch("\"hello world\"", 2);
:            test.doSearch("\"hello world\" +booleanField:false", 1);
:            test.doSearch("\"hello world\" +booleanField:true", 1);
:
:        } catch (Exception e) {
:            System.err.println(e.getMessage());
:        }
:    }
: }
:
:
: Chris Hostetter wrote:
: >
: >
: > : So, if I do as you suggest below (using PerFieldAnalyzerWrapper with
: > : StandardAnalyzer) then I still need to enclose in quotes the phrases
: > : (keywords with spaces) when I issue the search, and they are only
: > returned
: >
: > Yes, quotes will be neccessary to tell the QueryParser "this
: > is one chunk of text, passs it to the analyzer whole" - but that's so you
: > can get the "compelx" part of the problem you described... recognizing
: > that "my brown-cow" and "red fox" should be matched as seperate values
: > intead of trying to find one big vlaue containing "my brown-cow red fox"
: >
: > : in the results if the case is identical to how it was added?  (This
: > seems to
: > : be what I observe anyway.  And whether I add as TOKENIZED or
: > UN_TOKENIZED
: > : seems to have no effect.)
: >
: > 1) wether case matters is determined enitrely by your analyzer, if it
: >    produces differnet tokens for "Blue" and "BLUE" then case matters
: > 2) use TOKENIZED or your Analyzer will be completely irrelevant
: > 3) if you observse something working differently then you expect, post the
: >   code -- we're way pastthe point of being able to offer you any
: >   meaningful help without seeing a self contained example of what you want
: >   to see work.
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: [hidden email]
: > For additional commands, e-mail: [hidden email]
: >
: >
: >
:
: --
: View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6160316
: Sent from the Lucene - Java Users forum at Nabble.com.
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Philip Brown
Sorry for the confusion and thanks for taking the time to educate me.  So, if I am just indexing literal values, what is the best way to do that (what analyzer)?  Sounds like this approach, even though it works, is not the preferred method.

          analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer());
          analyzer.addAnalyzer("keyword", new KeywordAnalyzer());

Thanks again.


Chris Hostetter wrote
1) consider using JUnit tests .. it makes it a lot easier for other people
to understand your expecations, and if it winds up demonstraing a genuine
bug in Lucene, it's easy to add to the test tree.

2) as i said before, your fields must be TOKENIZED, or your analyzer is
irrelevant at index time.

3) when i run the code you sent as is, i get lots of "Test passed" lines
and no "TEST FAILED" lines ... which makes sense since you have everything
UN_TOKENIZED, so the literal values are getting indexed, which just so
happens to be what KeywwordAnalyzer does as well -- hence if you change
everything from UN_TOKENIZED to TOKENIZED it will still work.


do you have na example of something that *isn't* working the way you want?
... if not i don't see what your problem is, all your tests are passing :)


: Date: Tue, 5 Sep 2006 14:06:13 -0700 (PDT)
: From: Philip Brown <pmb@us.ibm.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: Phrase search using quotes -- special Tokenizer
:
:
: Here's a little sample program (borrowed some code from Erick Erickson :)).
: Whether I add as TOKENIZED or UN_TOKENIZED seems to make no difference in
: the output.  Is this what you'd expect?
:
: - Philip
:
: package com.test;
:
: import java.io.IOException;
: import java.util.HashSet;
: import java.util.regex.Pattern;
:
: import org.apache.lucene.analysis.Analyzer;
: import org.apache.lucene.analysis.KeywordAnalyzer;
: import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
: import org.apache.lucene.analysis.standard.StandardAnalyzer;
: import org.apache.lucene.document.Document;
: import org.apache.lucene.document.Field;
: import org.apache.lucene.index.IndexWriter;
: import org.apache.lucene.index.memory.PatternAnalyzer;
: import org.apache.lucene.queryParser.QueryParser;
: import org.apache.lucene.search.Hits;
: import org.apache.lucene.search.IndexSearcher;
: import org.apache.lucene.search.Query;
: import org.apache.lucene.store.RAMDirectory;
:
: public class Test2 {
:    private PerFieldAnalyzerWrapper analyzer = null;
:    private RAMDirectory idx = null;
:
:    private Analyzer getAnalyzer() {
:        if (analyzer == null) {
:         analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer());
:         analyzer.addAnalyzer("keyword", new KeywordAnalyzer());
:        }
:        return analyzer;
:
:    }
:
:    private void makeTestIndex() throws Exception {
: idx = new RAMDirectory();
:        IndexWriter writer = new IndexWriter(idx, getAnalyzer(), true);
: Document doc = new Document();
: doc.add(new Field("keyword", "hello world", Field.Store.YES,
: Field.Index.UN_TOKENIZED));
: doc.add(new Field("booleanField", "false", Field.Store.YES,
: Field.Index.UN_TOKENIZED));
: writer.addDocument(doc);
: doc = new Document();
: doc.add(new Field("keyword", "hello world", Field.Store.YES,
: Field.Index.UN_TOKENIZED));
: doc.add(new Field("booleanField", "true", Field.Store.YES,
: Field.Index.UN_TOKENIZED));
: writer.addDocument(doc);
: System.out.println(writer.docCount());
: writer.optimize();
: writer.close();
:    }
:
:    private void doSearch(String query, int expectedHits) throws Exception
: {
:        try {
:            QueryParser qp = new QueryParser("keyword", getAnalyzer());
:            IndexSearcher srch = new IndexSearcher(idx);
:            Query tmp = qp.parse(query);
:            // Uncomment to see parsed form of query
:             System.out.println("Parsed form is '" + tmp.toString() + "'");
:            Hits hits = srch.search(tmp);
:
:            String msg = "";
:
:            if (hits.length() == expectedHits) {
:                msg = "Test passed ";
:            } else {
:                msg = "************TEST FAILED************ ";
:            }
:            System.out.println(msg + "Expected "
:                    + Integer.toString(expectedHits) + " hits, got "
:                    + Integer.toString(hits.length()) + " hits");
:
:        } catch (IOException e) {
:            System.out.println("Caught IOException");
:            e.printStackTrace();
:        }
:    }
:
:
:    public static void main(String[] args) {
:        try {
:            Test2 test = new Test2();
:            test.makeTestIndex();
:            test.doSearch("Hello World", 0);
:            test.doSearch("hello world", 0);
:            test.doSearch("hello", 0);
:            test.doSearch("world", 0);
:
:            test.doSearch("\"Hello World\"", 0);
:            test.doSearch("\"hello world\"", 2);
:            test.doSearch("\"hello world\" +booleanField:false", 1);
:            test.doSearch("\"hello world\" +booleanField:true", 1);
:
:        } catch (Exception e) {
:            System.err.println(e.getMessage());
:        }
:    }
: }
:
:
: Chris Hostetter wrote:
: >
: >
: > : So, if I do as you suggest below (using PerFieldAnalyzerWrapper with
: > : StandardAnalyzer) then I still need to enclose in quotes the phrases
: > : (keywords with spaces) when I issue the search, and they are only
: > returned
: >
: > Yes, quotes will be neccessary to tell the QueryParser "this
: > is one chunk of text, passs it to the analyzer whole" - but that's so you
: > can get the "compelx" part of the problem you described... recognizing
: > that "my brown-cow" and "red fox" should be matched as seperate values
: > intead of trying to find one big vlaue containing "my brown-cow red fox"
: >
: > : in the results if the case is identical to how it was added?  (This
: > seems to
: > : be what I observe anyway.  And whether I add as TOKENIZED or
: > UN_TOKENIZED
: > : seems to have no effect.)
: >
: > 1) wether case matters is determined enitrely by your analyzer, if it
: >    produces differnet tokens for "Blue" and "BLUE" then case matters
: > 2) use TOKENIZED or your Analyzer will be completely irrelevant
: > 3) if you observse something working differently then you expect, post the
: >   code -- we're way pastthe point of being able to offer you any
: >   meaningful help without seeing a self contained example of what you want
: >   to see work.
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
: >
:
: --
: View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6160316
: Sent from the Lucene - Java Users forum at Nabble.com.
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Phrase search using quotes -- special Tokenizer

Chris Hostetter-3

: Sorry for the confusion and thanks for taking the time to educate me.  So, if
: I am just indexing literal values, what is the best way to do that (what
: analyzer)?  Sounds like this approach, even though it works, is not the
: preferred method.

if you truely want just the literal values then KeywordAnalyzer will work
great -- but you mentioned before that you want something more complicated
(case normalization i believe?) ... for something like that (lowercasing,
but preserving whitespace and punctuation) you'll need to write a custom
Analyzer ... that's not hard though, just glue together the
KeywordTokenizer with the LowerCaseFilter ala...

  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new LowerCaseFilter(new KeywordTokenizer(reader));
  }

...if there are other special rules you want, then put them in other
filters and compose your Analyzer further.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12