Questions regarding Lucene query syntax

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions regarding Lucene query syntax

Daniel Einspanjer
The query syntax reference page talks about the NOT and the - operators, but
it wasn't clear to me what exactly the difference is between them.  Could
someone tell me briefly what that difference might be or point me at some
further docs that describe it?


Is there a way to require a portion of a query only if there are values for
that field in the document?
e.g. If I know that I only want to match movies made between 1973 and 1975,
I would like to be able to say in my query that if the document has a year,
it must be in that range, but if the document has no year at all, don't fail
the document for that reason alone.
This is also important in the director name part.  If a document has a
director given, and it doesn't match what I'm searching for, that should be
a fail, but if the document has no director field, I don't want to fail the
document for that reason alone.

Thanks,

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: Questions regarding Lucene query syntax

Erick Erickson
See below...

On 5/5/07, Daniel Einspanjer <[hidden email]> wrote:
>
> The query syntax reference page talks about the NOT and the - operators,
> but
> it wasn't clear to me what exactly the difference is between them.  Could
> someone tell me briefly what that difference might be or point me at some
> further docs that describe it?


See the thread "Standard Parser Behavior". It has several explications
of what the Lucene query syntax is all about. This confuses everybody,
so I think that thread will help you a lot.

Also, see http://wiki.apache.org/lucene-java/BooleanQuerySyntax

Is there a way to require a portion of a query only if there are values for

> that field in the document?
> e.g. If I know that I only want to match movies made between 1973 and
> 1975,
> I would like to be able to say in my query that if the document has a
> year,
> it must be in that range, but if the document has no year at all, don't
> fail
> the document for that reason alone.
> This is also important in the director name part.  If a document has a
> director given, and it doesn't match what I'm searching for, that should
> be
> a fail, but if the document has no director field, I don't want to fail
> the
> document for that reason alone.


You'll have to include a dummy value I think. Remember that you're
searching for stuff with Lucene, so saying "match even if there's
nothing there" is, er, ABnormal..

I'd think about putting a dummy value in those fields you want to handle
this way. For instance, add "matchall" to documents with no date. Then
you'd need to add an 'or date:matchall' clause to all the dates you query
on. Make sure it's a value that behaves reasonably when you want to
include all dates, or all dates before ####, or all dates after ####.

Best
Erick

> Thanks,
>
> Daniel
>
Reply | Threaded
Open this post in threaded view
|

Re: Questions regarding Lucene query syntax

Daniel Einspanjer
On 5/6/07, Erick Erickson <[hidden email]> wrote:

>
> On 5/5/07, Daniel Einspanjer <[hidden email]> wrote:
> >
> > The query syntax reference page talks about the NOT and the - operators,
> > but
> > it wasn't clear to me what exactly the difference is between
> them.  Could
> > someone tell me briefly what that difference might be or point me at
> some
> > further docs that describe it?
>
> See the thread "Standard Parser Behavior". It has several explications
> of what the Lucene query syntax is all about. This confuses everybody,
> so I think that thread will help you a lot.
>
> Also, see http://wiki.apache.org/lucene-java/BooleanQuerySyntax



I'll take a look for this thread right now, and make sure I've already read
that wiki page.

Is there a way to require a portion of a query only if there are values for

> > that field in the document?
> > e.g. If I know that I only want to match movies made between 1973 and
> > 1975,
> > I would like to be able to say in my query that if the document has a
> > year,
> > it must be in that range, but if the document has no year at all, don't
> > fail
> > the document for that reason alone.
> > This is also important in the director name part.  If a document has a
> > director given, and it doesn't match what I'm searching for, that should
> > be
> > a fail, but if the document has no director field, I don't want to fail
> > the
> > document for that reason alone.
>
>
> You'll have to include a dummy value I think. Remember that you're
> searching for stuff with Lucene, so saying "match even if there's
> nothing there" is, er, ABnormal..
>
> I'd think about putting a dummy value in those fields you want to handle
> this way. For instance, add "matchall" to documents with no date. Then
> you'd need to add an 'or date:matchall' clause to all the dates you query
> on. Make sure it's a value that behaves reasonably when you want to
> include all dates, or all dates before ####, or all dates after ####.
>

Hrm.  I'll keep this idea on the cheat sheet for now. It turns out that
having a required date was causing too many mismatches for me.  Some of the
source feeds I'm matching have wildly inaccurate year fields, and when I
required that field, it would pull out some other poorly related item based
on the year and director, ignoring the right one because the year was bad.


By far the thing that is killing me the most is my trouble with trying to
provide users with scores that make sense from one item to the other.  I
tried out the SweetSpotSimilarity contrib, and I *think* it might have
helped the matching in general some, but it doesn't really give me a linear
range of scores that can be used for comparisons.  I keep scouring the web
looking for something that might explain enough tf and idf and norms in
terms that I could understand, but sadly, it just seems to be a bit over my
head right now. :/ Maybe I've just been fighting with this project for so
long my brain has turned to mush.

If I could find a way that the scores for the queries I've mentioned in this
thread and others could just return a simple linear scale (affected by
^boosts would be good though) for the number of terms matched, I think I'd
be all set.
Reply | Threaded
Open this post in threaded view
|

Re: Questions regarding Lucene query syntax

Doron Cohen
> Is there a way to require a portion of a query only if there are values
for
> > > that field in the document?
> > > e.g. If I know that I only want to match movies made between 1973 and
> > > 1975,
> > > I would like to be able to say in my query that if the document has a
> > > year,
> > > it must be in that range, but if the document has no year at all,
don't
> > > fail
> > > the document for that reason alone.
> > > This is also important in the director name part.  If a document has
a
> > > director given, and it doesn't match what I'm searching for, that
should
> > > be
> > > a fail, but if the document has no director field, I don't want to
fail
> > > the
> > > document for that reason alone.
> >
> >
> > You'll have to include a dummy value I think. Remember that you're
> > searching for stuff with Lucene, so saying "match even if there's
> > nothing there" is, er, ABnormal..
> >
> > I'd think about putting a dummy value in those fields you want to
handle
> > this way. For instance, add "matchall" to documents with no date. Then
> > you'd need to add an 'or date:matchall' clause to all the dates you
query
> > on. Make sure it's a value that behaves reasonably when you want to
> > include all dates, or all dates before ####, or all dates after ####.
> >
>
> Hrm.  I'll keep this idea on the cheat sheet for now. It turns out that

Just to note that in case you do want this, then while
it would be more efficient to index a matchall word (as
Erik suggested), in case it was too late for this (index
already exists, etc.), it is still possible to phrase a
query that applies a range filter only upon docs containing
the range filter field.

With a query parser set to allowLeadingWildcard, this should do:

( +item -price:* ) ( +item +price:[0100 TO 0150] )

or, to avoid too-many-cluases risk:

( +item -price:[MIN TO MAX]) ( +item +price:[0100 TO 0150] )

where MIN and MAX cover (at least) the full range of the ranged field.

Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Questions regarding Lucene query syntax

Daniel Einspanjer
On 5/7/07, Doron Cohen <[hidden email]> wrote:
> With a query parser set to allowLeadingWildcard, this should do:
> ( +item -price:* ) ( +item +price:[0100 TO 0150] )
> or, to avoid too-many-cluases risk:
> ( +item -price:[MIN TO MAX]) ( +item +price:[0100 TO 0150] )
> where MIN and MAX cover (at least) the full range of the ranged field.

Nice! This tip will be a handy one to have. Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]