default AND operator

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

default AND operator

dr fence
Why does my query "french AND antiques" work the way I expect using this
code:

  stemParser = new QueryParser("contents", stemmingAnalyzer);
  Query query = stemParser.parse(searchTerms);
  Hits docHits = searcher.search(query);

Debug from query shows: contents:french contents:antiqu  ... I would have
expected to see '+' before contents.

But not if I try the query again with "french antiques" with this code ...
which sets the default operator to AND:

   stemParser = new QueryParser("contents", stemmingAnalyzer);
  stemParser.setDefaultOperator(QueryParser.Operator.AND);
  Query query = stemParser.parse(searchTerms);
  Hits docHits = searcher.search(query);

Debug from Query shows this:  +contents:french +contents:antiqu
Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

Chris Hostetter-3

: Why does my query "french AND antiques" work the way I expect using this
: code:

can you be more specific about what it is you "expect", and what exactly
serachTerms is in your examples?  (presumably it's a string, is it the
string "french AND antiques" ... are you sure it's not "french and
antiques" ? ... QueryParser only respects AND and OR if they are
capitalized, otherwise they are treated as normal words, which are
probably StopWords to your analyzer .. in which case everything you've
shown makes perfect sense to me.)


 :
:   stemParser = new QueryParser("contents", stemmingAnalyzer);
:   Query query = stemParser.parse(searchTerms);
:   Hits docHits = searcher.search(query);
:
: Debug from query shows: contents:french contents:antiqu  ... I would have
: expected to see '+' before contents.
:
: But not if I try the query again with "french antiques" with this code ...
: which sets the default operator to AND:
:
:    stemParser = new QueryParser("contents", stemmingAnalyzer);
:   stemParser.setDefaultOperator(QueryParser.Operator.AND);
:   Query query = stemParser.parse(searchTerms);
:   Hits docHits = searcher.search(query);
:
: Debug from Query shows this:  +contents:french +contents:antiqu
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

dr fence
When I use "french AND antiques" I get documents like this :

score: 1.0, boost: 1.0, cont: French Antiques
score: 0.23080501, boost: 1.0, cont: FRENCH SEPTIC
score: 0.23080501, boost: 1.0, cont: French & French Septic
score: 0.20400475, boost: 1.0,id: 25460, cont: French & Associates

As in the first e-mail the Query object shows these terms:

contents:french contents:antiqu  <---- using string "french AND antiques"

when using Operator.AND it shows these:

+contents:french +contents:antiqu      <----- here I used used "french
antiques"

The second example below matches NONE of the documents above and in fact
only if I do synonym expansion with stemming.

*****My big question here is why doesn't the operator.AND force both of
these queries to be identical? These will be users typed queries so I want
Lucene to force the use of AND so I don't have to search/replace


On 9/16/06, Chris Hostetter <[hidden email]> wrote:

>
> can you be more specific about what it is you "expect", and what exactly
> serachTerms is in your examples?  (presumably it's a string, is it the
> string "french AND antiques" ... are you sure it's not "french and
> antiques" ? ... QueryParser only respects AND and OR if they are
> capitalized, otherwise they are treated as normal words, which are
> probably StopWords to your analyzer .. in which case everything you've
> shown makes perfect sense to me.)
>
>
> :
> :   stemParser = new QueryParser("contents", stemmingAnalyzer);
> :   Query query = stemParser.parse(searchTerms);
> :   Hits docHits = searcher.search(query);
> :
> : Debug from query shows: contents:french contents:antiqu  ... I would
> have
> : expected to see '+' before contents.
> :
> : But not if I try the query again with "french antiques" with this code
> ...
> : which sets the default operator to AND:
> :
> :    stemParser = new QueryParser("contents", stemmingAnalyzer);
> :   stemParser.setDefaultOperator(QueryParser.Operator.AND);
> :   Query query = stemParser.parse(searchTerms);
> :   Hits docHits = searcher.search(query);
> :
> : Debug from Query shows this:  +contents:french +contents:antiqu
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

Erick Erickson
Are you really, really sure that your *analyzer* isn't automatically
lower-casing your *query* and turning "french AND antiques" into "french and
antiques", then, as Chris says, treating "and" as a stop word?

The fact that your parser transforms "antiques" into "antiqu" leads me to
suspect that there's a lot more going on in the parser analyzer than you
might expect....

And, in case you haven't already found it, are you sure what your index
contains. I've found luke (google luke lucene) to be very valuable for these
kinds of questions, particularly your issue about stemming etc.

Best
Erick

On 9/17/06, no spam <[hidden email]> wrote:

>
> When I use "french AND antiques" I get documents like this :
>
> score: 1.0, boost: 1.0, cont: French Antiques
> score: 0.23080501, boost: 1.0, cont: FRENCH SEPTIC
> score: 0.23080501, boost: 1.0, cont: French & French Septic
> score: 0.20400475, boost: 1.0,id: 25460, cont: French & Associates
>
> As in the first e-mail the Query object shows these terms:
>
> contents:french contents:antiqu  <---- using string "french AND antiques"
>
> when using Operator.AND it shows these:
>
> +contents:french +contents:antiqu      <----- here I used used "french
> antiques"
>
> The second example below matches NONE of the documents above and in fact
> only if I do synonym expansion with stemming.
>
> *****My big question here is why doesn't the operator.AND force both of
> these queries to be identical? These will be users typed queries so I want
> Lucene to force the use of AND so I don't have to search/replace
>
>
> On 9/16/06, Chris Hostetter <[hidden email]> wrote:
> >
> > can you be more specific about what it is you "expect", and what exactly
> > serachTerms is in your examples?  (presumably it's a string, is it the
> > string "french AND antiques" ... are you sure it's not "french and
> > antiques" ? ... QueryParser only respects AND and OR if they are
> > capitalized, otherwise they are treated as normal words, which are
> > probably StopWords to your analyzer .. in which case everything you've
> > shown makes perfect sense to me.)
> >
> >
> > :
> > :   stemParser = new QueryParser("contents", stemmingAnalyzer);
> > :   Query query = stemParser.parse(searchTerms);
> > :   Hits docHits = searcher.search(query);
> > :
> > : Debug from query shows: contents:french contents:antiqu  ... I would
> > have
> > : expected to see '+' before contents.
> > :
> > : But not if I try the query again with "french antiques" with this code
> > ...
> > : which sets the default operator to AND:
> > :
> > :    stemParser = new QueryParser("contents", stemmingAnalyzer);
> > :   stemParser.setDefaultOperator(QueryParser.Operator.AND);
> > :   Query query = stemParser.parse(searchTerms);
> > :   Hits docHits = searcher.search(query);
> > :
> > : Debug from Query shows this:  +contents:french +contents:antiqu
> > :
> >
> >
> >
> > -Hoss
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

Mark Miller-3
3 docs with one field each in index:
-------------------------------------
french beast stone
crazy rolling stone
rolling stone done in by coconut

3 searches, default op set as AND
-------------------------------------
search("coconut stone");
search("coconut OR stone");
search("coconut AND stone");

3 results:
--------------------------------------
query: +allFields:coconut +allFields:stone
Found 1 document(s) (in 31 milliseconds) that matched query 'coconut stone':

query: allFields:coconut allFields:stone
Found 3 document(s) (in 0 milliseconds) that matched query 'coconut OR
stone':

query: +allFields:coconut +allFields:stone
Found 1 document(s) (in 16 milliseconds) that matched query 'coconut AND
stone':


You do not find this to be true? Your analyzer should not be a problem
as the Queryparser will only analyze non queryparser syntax keywords.

Code follows:

public class Tester {
    private static RAMDirectory directory;

    private static Analyzer analyzer;

    public static void main(String[] args) {
        setupIndex();
       
        try {
            search("coconut stone");
            search("coconut OR stone");
            search("coconut AND stone");
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    private static void setupIndex() {
        directory = new RAMDirectory();

        analyzer = new WhitespaceAnalyzer();

        IndexWriter writer;

        try {
            writer = new IndexWriter(directory, analyzer, true);

            Document doc = new Document();
            doc.add(new Field("allFields",
                    "french beast stone",
                    Field.Store.NO, Field.Index.TOKENIZED));

            writer.addDocument(doc);


            doc = new Document();
            doc.add(new Field("allFields", "crazy rolling stone",
                    Field.Store.NO, Field.Index.TOKENIZED));
            writer.addDocument(doc);
           
            doc = new Document();
            doc.add(new Field("allFields", "rolling stone done in by
coconut",
                    Field.Store.NO, Field.Index.TOKENIZED));
            writer.addDocument(doc);


            writer.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
   
    public static int search(String q) throws Exception {
        IndexSearcher is = new IndexSearcher(directory);

        QueryParser qp = new QueryParser("allFields", analyzer);
       
        qp.setDefaultOperator(Operator.AND);
       
        Query query = qp.parse(q);
       
        long start = new Date().getTime();
        Hits hits = is.search(query);
        long end = new Date().getTime();
        System.err.println("\nquery: " + query.toString());
        System.err.println("Found " + hits.length() + " document(s) (in " +
            (end - start) + " milliseconds) that matched query '" + q +
"':");
       
        return hits.length();
    }
}

Erick Erickson wrote:

> Are you really, really sure that your *analyzer* isn't automatically
> lower-casing your *query* and turning "french AND antiques" into
> "french and
> antiques", then, as Chris says, treating "and" as a stop word?
>
> The fact that your parser transforms "antiques" into "antiqu" leads me to
> suspect that there's a lot more going on in the parser analyzer than you
> might expect....
>
> And, in case you haven't already found it, are you sure what your index
> contains. I've found luke (google luke lucene) to be very valuable for
> these
> kinds of questions, particularly your issue about stemming etc.
>
> Best
> Erick
>
> On 9/17/06, no spam <[hidden email]> wrote:
>>
>> When I use "french AND antiques" I get documents like this :
>>
>> score: 1.0, boost: 1.0, cont: French Antiques
>> score: 0.23080501, boost: 1.0, cont: FRENCH SEPTIC
>> score: 0.23080501, boost: 1.0, cont: French & French Septic
>> score: 0.20400475, boost: 1.0,id: 25460, cont: French & Associates
>>
>> As in the first e-mail the Query object shows these terms:
>>
>> contents:french contents:antiqu  <---- using string "french AND
>> antiques"
>>
>> when using Operator.AND it shows these:
>>
>> +contents:french +contents:antiqu      <----- here I used used "french
>> antiques"
>>
>> The second example below matches NONE of the documents above and in fact
>> only if I do synonym expansion with stemming.
>>
>> *****My big question here is why doesn't the operator.AND force both of
>> these queries to be identical? These will be users typed queries so I
>> want
>> Lucene to force the use of AND so I don't have to search/replace
>>
>>
>> On 9/16/06, Chris Hostetter <[hidden email]> wrote:
>> >
>> > can you be more specific about what it is you "expect", and what
>> exactly
>> > serachTerms is in your examples?  (presumably it's a string, is it the
>> > string "french AND antiques" ... are you sure it's not "french and
>> > antiques" ? ... QueryParser only respects AND and OR if they are
>> > capitalized, otherwise they are treated as normal words, which are
>> > probably StopWords to your analyzer .. in which case everything you've
>> > shown makes perfect sense to me.)
>> >
>> >
>> > :
>> > :   stemParser = new QueryParser("contents", stemmingAnalyzer);
>> > :   Query query = stemParser.parse(searchTerms);
>> > :   Hits docHits = searcher.search(query);
>> > :
>> > : Debug from query shows: contents:french contents:antiqu  ... I would
>> > have
>> > : expected to see '+' before contents.
>> > :
>> > : But not if I try the query again with "french antiques" with this
>> code
>> > ...
>> > : which sets the default operator to AND:
>> > :
>> > :    stemParser = new QueryParser("contents", stemmingAnalyzer);
>> > :   stemParser.setDefaultOperator(QueryParser.Operator.AND);
>> > :   Query query = stemParser.parse(searchTerms);
>> > :   Hits docHits = searcher.search(query);
>> > :
>> > : Debug from Query shows this:  +contents:french +contents:antiqu
>> > :
>> >
>> >
>> >
>> > -Hoss
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>> >
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

dr fence
In reply to this post by Erick Erickson
I am new to Lucene so I'll admit I am confused by a few things.  I'm using
an index which was built with the StandardAnalyzer.  I have verified this by
using an IndexReader to read the docs back out ... Antiques is not Antiq in
the index.   So according to this note in the Lucene docs I would assume a
Query parsed without a stemming analyzer would have matched:

"Note: The analyzer used to create the index will be used on the terms and
phrases in the query string. So it is important to choose an analyzer that
will not interfere with the terms used in the query string."

But it's quite the opposite, only a query parsed with the stemming analyzer
is matching my queries.  So these are a few confusing issues which to me
seem a *bit* beside the point ... perhaps I'm wrong.

HOWEVER .. I'm still confused as to why the AND operator isn't matching my
"french AND antiques" query regardless of the index.

I will look into Luke ... thanks for your replies ... Mark

On 9/17/06, Erick Erickson <[hidden email]> wrote:

>
> Are you really, really sure that your *analyzer* isn't automatically
> lower-casing your *query* and turning "french AND antiques" into "french
> and
> antiques", then, as Chris says, treating "and" as a stop word?
>
> The fact that your parser transforms "antiques" into "antiqu" leads me to
> suspect that there's a lot more going on in the parser analyzer than you
> might expect....
>
> And, in case you haven't already found it, are you sure what your index
> contains. I've found luke (google luke lucene) to be very valuable for
> these
> kinds of questions, particularly your issue about stemming etc.
>
> Best
> Erick
>
>
Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

Erick Erickson
Well, I'm puzzled as well, in my simple examples I just ran, the AND
operator behaves just fine, but that was using StandardAnalyzer. So it's
almost certain we're not talking about the same thing <G>...

So, I guess I have a couple of suggestions:

1> Try your query without the stemmingAnalyzer. Try StandardAnalyzer (or
even WhitespaceAnalyzer) and kind of build up to the stemmer. That'll at
least narrow the problem space.

2> You might post more details about the stemmingAnalyzer you're using. It's
possible that there's some innocuous-seeming line in the creation of the
stemmingAnalyzer you're feeding into the query parser that's producing this
behavior. Parenthetically, I'm not entirely sure you're not going to get
into a heap o' trouble using a StandardAnalyzer to create the index then
using a stemmingAnalyzer to query it. But, as you say, that's secondary to
the default AND question. I should also add that I don't know enough about
stemming analyzers to put in a thimble, so this is just a theoretical
concern.

3> Create a small, self-contained program that demonstrates this issue and
post it here. Or, even better, a junit test <G>.

I think we've exhausted the generic issues you might be having and could get
a much faster resolution with a complete example to look at. "The guys" have
been generous with many posters in looking at actual code......

Best
Erick.

P.S. Please post whatever the resolution is, I'm pretty curious what you
find.

On 9/17/06, no spam <[hidden email]> wrote:

>
> I am new to Lucene so I'll admit I am confused by a few things.  I'm using
> an index which was built with the StandardAnalyzer.  I have verified this
> by
> using an IndexReader to read the docs back out ... Antiques is not Antiq
> in
> the index.   So according to this note in the Lucene docs I would assume a
> Query parsed without a stemming analyzer would have matched:
>
> "Note: The analyzer used to create the index will be used on the terms and
> phrases in the query string. So it is important to choose an analyzer that
> will not interfere with the terms used in the query string."
>
> But it's quite the opposite, only a query parsed with the stemming
> analyzer
> is matching my queries.  So these are a few confusing issues which to me
> seem a *bit* beside the point ... perhaps I'm wrong.
>
> HOWEVER .. I'm still confused as to why the AND operator isn't matching my
> "french AND antiques" query regardless of the index.
>
> I will look into Luke ... thanks for your replies ... Mark
>
> On 9/17/06, Erick Erickson <[hidden email]> wrote:
> >
> > Are you really, really sure that your *analyzer* isn't automatically
> > lower-casing your *query* and turning "french AND antiques" into "french
> > and
> > antiques", then, as Chris says, treating "and" as a stop word?
> >
> > The fact that your parser transforms "antiques" into "antiqu" leads me
> to
> > suspect that there's a lot more going on in the parser analyzer than you
> > might expect....
> >
> > And, in case you haven't already found it, are you sure what your index
> > contains. I've found luke (google luke lucene) to be very valuable for
> > these
> > kinds of questions, particularly your issue about stemming etc.
> >
> > Best
> > Erick
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

dr fence
Ok guys ... you're going to want to yield a big stick to me.  The problem
was my HItCollector, I wasn't actually passing it to my searcher.  Yes
somewhere in my testing I had commented out that code and it was making it
look like I wasn't getting hits.

One more question about IndexWriters (maybe I don't deserve an answer here
:-) )  .... I assume that the Analyzer used is applied and written to the
index per field.  So if I wanted one for Snowball or Stemming I'd have to
write multiple indexes?  I'm a bit confused as to how the Stemmed queries
are being matched against my StandardAnalyzer index.

Thanks for the help!
Mark

On 9/17/06, Erick Erickson <[hidden email]> wrote:

>
> Well, I'm puzzled as well, in my simple examples I just ran, the AND
> operator behaves just fine, but that was using StandardAnalyzer. So it's
> almost certain we're not talking about the same thing <G>...
>
> So, I guess I have a couple of suggestions:
>
> 1> Try your query without the stemmingAnalyzer. Try StandardAnalyzer (or
> even WhitespaceAnalyzer) and kind of build up to the stemmer. That'll at
> least narrow the problem space.
>
> 2> You might post more details about the stemmingAnalyzer you're using.
> It's
> possible that there's some innocuous-seeming line in the creation of the
> stemmingAnalyzer you're feeding into the query parser that's producing
> this
> behavior. Parenthetically, I'm not entirely sure you're not going to get
> into a heap o' trouble using a StandardAnalyzer to create the index then
> using a stemmingAnalyzer to query it. But, as you say, that's secondary to
> the default AND question. I should also add that I don't know enough about
> stemming analyzers to put in a thimble, so this is just a theoretical
> concern.
>
> 3> Create a small, self-contained program that demonstrates this issue and
> post it here. Or, even better, a junit test <G>.
>
> I think we've exhausted the generic issues you might be having and could
> get
> a much faster resolution with a complete example to look at. "The guys"
> have
> been generous with many posters in looking at actual code......
>
> Best
> Erick.
>
> P.S. Please post whatever the resolution is, I'm pretty curious what you
> find.
>
> On 9/17/06, no spam <[hidden email]> wrote:
> >
> > I am new to Lucene so I'll admit I am confused by a few things.  I'm
> using
> > an index which was built with the StandardAnalyzer.  I have verified
> this
> > by
> > using an IndexReader to read the docs back out ... Antiques is not Antiq
> > in
> > the index.   So according to this note in the Lucene docs I would assume
> a
> > Query parsed without a stemming analyzer would have matched:
> >
> > "Note: The analyzer used to create the index will be used on the terms
> and
> > phrases in the query string. So it is important to choose an analyzer
> that
> > will not interfere with the terms used in the query string."
> >
> > But it's quite the opposite, only a query parsed with the stemming
> > analyzer
> > is matching my queries.  So these are a few confusing issues which to me
> > seem a *bit* beside the point ... perhaps I'm wrong.
> >
> > HOWEVER .. I'm still confused as to why the AND operator isn't matching
> my
> > "french AND antiques" query regardless of the index.
> >
> > I will look into Luke ... thanks for your replies ... Mark
> >
> > On 9/17/06, Erick Erickson <[hidden email]> wrote:
> > >
> > > Are you really, really sure that your *analyzer* isn't automatically
> > > lower-casing your *query* and turning "french AND antiques" into
> "french
> > > and
> > > antiques", then, as Chris says, treating "and" as a stop word?
> > >
> > > The fact that your parser transforms "antiques" into "antiqu" leads me
> > to
> > > suspect that there's a lot more going on in the parser analyzer than
> you
> > > might expect....
> > >
> > > And, in case you haven't already found it, are you sure what your
> index
> > > contains. I've found luke (google luke lucene) to be very valuable for
> > > these
> > > kinds of questions, particularly your issue about stemming etc.
> > >
> > > Best
> > > Erick
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

Chris Hostetter-3

: One more question about IndexWriters (maybe I don't deserve an answer here
: :-) )  .... I assume that the Analyzer used is applied and written to the
: index per field.  So if I wanted one for Snowball or Stemming I'd have to
: write multiple indexes?  I'm a bit confused as to how the Stemmed queries
: are being matched against my StandardAnalyzer index.

what do you mean "written to the index per field" .. analyzers aren't
written to the index at all, the analyzer used is completely forgotten
once your index is built.  if you want seperate analyzers per field, take
a look at the PerFieldAnalyzerWrapper (i think that's the name) ... as for
why Stemmed Queries might match on terms indexed using StandardAnalyzer
... who knows ... it depends on how exactly they are getting stemmed, and
what other types of data might have made it into your index (maybe your
source data had the words you are searching on spelled incorrectly as
well, and it just happens to match the stemmed versions).

When you have questions like this, searcher.explain is your friend.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

dr fence
That question was badly worded.  I was trying to ask that when I write an
index using the StandardAnalyzer, the docs are transformed using that
analyzer then written to the index post transformation. So stop words or
things like apostrophes would be removed.

"Scott's Lawn and Garden Care"     becomes    "Scott Lawn Garden Care"

It just seems that my index written using the StandardAnalyzer still has
things like apostophes and also things like the & symbol.

On 9/17/06, Chris Hostetter <[hidden email]> wrote:

>
>
> what do you mean "written to the index per field" .. analyzers aren't
> written to the index at all, the analyzer used is completely forgotten
> once your index is built.  if you want seperate analyzers per field, take
> a look at the PerFieldAnalyzerWrapper (i think that's the name) ... as for
> why Stemmed Queries might match on terms indexed using StandardAnalyzer
> ... who knows ... it depends on how exactly they are getting stemmed, and
> what other types of data might have made it into your index (maybe your
> source data had the words you are searching on spelled incorrectly as
> well, and it just happens to match the stemmed versions).
>
> When you have questions like this, searcher.explain is your friend.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

Chris Hostetter-3
: index using the StandardAnalyzer, the docs are transformed using that
: analyzer then written to the index post transformation. So stop words or
: things like apostrophes would be removed.

if the analyzer used behaves that way, then yes -- the indexed terms will
remove those things.

: "Scott's Lawn and Garden Care"     becomes    "Scott Lawn Garden Care"
:
: It just seems that my index written using the StandardAnalyzer still has
: things like apostophes and also things like the & symbol.

1) maybe you didn't really use StandardAnalyzer when the index was built?
2) keep in mind there is a differnece between the indexed terms (matched
when doing queries) and the the stored values of fields which are
displayed when you look at docs -- the stored values are never affected by
the analyzer.  when you say you still see apostophes in your index, are
you looking at hte stored values, or are you looking ta the indexed Terms?
.. as i recall Luke lets you see both.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

Erick Erickson
In reply to this post by dr fence
You probably want to tak a closer look at the StandardAnalyzer. It uses
StandardTokenizer and StandardFilter. From the javadoc

<<<<<StandardTokenizer

A grammar-based tokenizer constructed with JavaCC.

This should be a good tokenizer for most European-language documents:


   - Splits words at punctuation characters, removing punctuation.
   However, a dot that's not followed by whitespace is considered part of a
   token.
   - Splits words at hyphens, unless there's a number in the token, in
   which case the whole token is interpreted as a product number and is not
   split.
   - Recognizes email addresses and internet hostnames as one token.


any applications have specific tokenizer needs. If this tokenizer does not
suit your application, please consider copying this source code directory to
your project and maintaining your own grammar-based tokenizer.
>>>>

When I first started with Lucene, I was surprised that StandardAnalyzer did
the tricks it does. I quickly found that, especially when starting out, I
got more intuitive results by using one of the simpler analyzers,
WhitespaceAnalyzer, StopAnalyzer or SimpleAnalyzer.

And one of the coolest analyzers is PatternAnalyzer down in
org.apache.lucene.index.memory.PatternAnalyzer

which uses a regular expression to tokenize streams. But do note if you use
this that the regex recognizes tokens to *break* on, not what constitutes a
token....

Best
Erick

On 9/17/06, no spam <[hidden email]> wrote:

>
> That question was badly worded.  I was trying to ask that when I write an
> index using the StandardAnalyzer, the docs are transformed using that
> analyzer then written to the index post transformation. So stop words or
> things like apostrophes would be removed.
>
> "Scott's Lawn and Garden Care"     becomes    "Scott Lawn Garden Care"
>
> It just seems that my index written using the StandardAnalyzer still has
> things like apostophes and also things like the & symbol.
>
> On 9/17/06, Chris Hostetter <[hidden email]> wrote:
> >
> >
> > what do you mean "written to the index per field" .. analyzers aren't
> > written to the index at all, the analyzer used is completely forgotten
> > once your index is built.  if you want seperate analyzers per field,
> take
> > a look at the PerFieldAnalyzerWrapper (i think that's the name) ... as
> for
> > why Stemmed Queries might match on terms indexed using StandardAnalyzer
> > ... who knows ... it depends on how exactly they are getting stemmed,
> and
> > what other types of data might have made it into your index (maybe your
> > source data had the words you are searching on spelled incorrectly as
> > well, and it just happens to match the stemmed versions).
> >
> > When you have questions like this, searcher.explain is your friend.
> >
> >
> >
> > -Hoss
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: default AND operator

dr fence
In reply to this post by Chris Hostetter-3
Truly I am new to Lucene.  That's the missing part ... I'm looking at the
stored values and not the indexed terms.

Mark

On 9/17/06, Chris Hostetter <[hidden email]> wrote:

1) maybe you didn't really use StandardAnalyzer when the index was built?

> 2) keep in mind there is a differnece between the indexed terms (matched
> when doing queries) and the the stored values of fields which are
> displayed when you look at docs -- the stored values are never affected by
> the analyzer.  when you say you still see apostophes in your index, are
> you looking at hte stored values, or are you looking ta the indexed Terms?
> .. as i recall Luke lets you see both.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>