Test new query parser?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Test new query parser?

Mark Miller-3
Is anyone interested in helping me test out a new query parser (i.e is
anyone interested in using this, thereby helping me test it) ?

 The parser uses a intermediate parse tree representation, unlike Lucene's
Query Filter.


The syntax:

date[april 6, 1992] & field2,field3[parrot ~3s yore] | ((cat | horse) &
rabbit ~6 pete)

'&' is an 'and'


'|' is an 'or'


~2 is a 'within 2 words'


~5p is a 'within 5 paragraphs'


~6s is a 'within 6 sentences'


'!' is 'butnot'


' ' is an 'and' that binds tighter than anything else but can only connect
search tokens (i.e 'old man' not 'date[today] man'--that would be
date[today] & man...changeable of course:_)


field1, field2, field3[mark | butternut] performs a search on all 3 fields


date[20050601 to 20060504] uses constant scoring range query filter to date
search (can parse most date formats).


no arbitrary range search yet...

paragraph/sentence support requires a replacement (supplied) standard
analyzer and is optional.
there is also support for a "did you mean" feature that spits back the query
substituted with a did you mean guess based on the words in a supplied index
(the corpus or a dictionary maybe):


field1,field2[horke | tomcat] might return a suggested search of =
field1,field2[horse | tomcat]

I am about to add quoted searches as well as the ability to escape query
keywords.

more feature to come, there is a lot I want to change and add (and fix)...

if you are interested, drop me a line at [hidden email]

- mark

Some sample proximity queries:

        example = "monkey fowl ~3 man ~5 horse head ~4 lamb";
        expected = "+(+spanNear([allFields:monkey, allFields:man], 3, false)
+spanNear([allFields:fowl, allFields:man], 3, false))
+(+spanNear([allFields:man, allFields:horse], 5, false)
+spanNear([allFields:man, allFields:head], 5, false)
+(+spanNear([allFields:horse, allFields:lamb], 4, false)
+spanNear([allFields:head, allFields:lamb], 4, false)))";
        assertEquals(expected, parse(example));
//
        example = "monkey ~3 man";
        expected = "spanNear([allFields:monkey, allFields:man], 3, false)";
        assertEquals(expected, parse(example));

        example = "monkey ord~3 man";
        expected = "spanNear([allFields:monkey, allFields:man], 3, true)";
        assertEquals(expected, parse(example));

        example = "monkey ~3 man ~2 her";
        expected = "+spanNear([allFields:monkey, allFields:man], 3, false)
+spanNear([allFields:man, allFields:her], 2, false)";
        assertEquals(expected, parse(example));

        example = "(fowl & helicopter) ~8 hillary";
        expected = "+spanNear([allFields:fowl, allFields:hillary], 8, false)
+spanNear([allFields:helicopter, allFields:hillary], 8, false)";
        assertEquals(expected, parse(example));

        example = "(fowl | helicopter) ~6 hillary";
        expected = "+spanNear([allFields:fowl, allFields:hillary], 8, false)
+spanNear([allFields:helicopter, allFields:hillary], 8, false)";
        assertEquals(expected, parse(example));
//
//        // butnot resolves before proximity search
//        example = "(cop | fowl) & (fowl & priest man) ! helicopter ~8
hillary";
//        expected = "+(allFields:cop allFields:fowl)
+(+spanNear([allFields:fowl, allFields:hillary], 8, false)
+spanNear([allFields:priest, allFields:hillary], 8, false)
+spanNear([allFields:man, allFields:hillary], 8, false)
-spanNear([allFields:helicopter, allFields:hillary], 8, false))";
//        assertEquals(expected, parse(example));
//
        example = "priest man ! helicopter ~8 hillary";
        expected = "+spanNear([allFields:priest, allFields:hillary], 8,
false) +spanNear([allFields:man, allFields:hillary], 8, false)
-spanNear([allFields:helicopter, allFields:hillary], 8, false)";
        assertEquals(expected, parse(example));
Reply | Threaded
Open this post in threaded view
|

Re: Test new query parser?

Erik Hatcher

On Aug 21, 2006, at 3:39 PM, Mark Miller wrote:
> Is anyone interested in helping me test out a new query parser (i.e is
> anyone interested in using this, thereby helping me test it) ?

I'm definitely interested in giving it a try.  The syntax looks nice.

> ~5p is a 'within 5 paragraphs'
>
> ~6s is a 'within 6 sentences'

What defines paragraphs and sentences?   What types of queries do  
those resolve to?

Thanks,
        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Test new query parser?

Mark Miller-3
Great, I will get something ready to be given out within a day or so then.

Paragraph/Sent prox support is one thing I really need to test and improve.

The parapraph and sentence search uses a SpanWithinQuery. This is just a
SpanNotQuery that can span a specified number of times instead of not at
all.

mark ~3p dopey

SpanWithinQuery(spanNear([allFields:mark, allFields:dopey], 99999, false),
3, allFields:¶)

>
> mark ord~3p dopey

SpanWithinQuery(spanNear([allFields:mark, allFields:dopey], 99999, true), 3,
allFields:¶)

I use 99999 terribly arbitrarily. Integer.max_value blows things up.

This example uses ¶ as a marker. I think a better default might be a double
newline of some kind. Unfortunately, the marker is not nicely configurable
without source code access because it must be defined in the .jj file of the
modified standard analyzer. Sentences can be deduced more universally:

Sentences are matched with:

<SENTENCE: ([".","?","!","]","[","]","\"","'",")"])+ ([" ","\t","\r","\n"])+
(["[","\"","'","`","("])* >

which is seems to be the standard sentence finder, although it lacks the
ability to check that an alphanumeric comes next (so that it doesn't mark a
sentence at the end of the text). No biggie for now though.

Also, right now paragraphs and sentences are marked separately at all
spots...for compactness a paragraph marker should also represent a sentence
marker. I need to check speed on these things though.

With some others interested in this I will get a move on these issues
though. I really want some feedback to make this thing better. It is not
perfect but I believe it has potential. Adding the other span type query for
example, would not be very difficult. The syntax is pretty easily modified
and extended

- Mark
Reply | Threaded
Open this post in threaded view
|

Re: Test new query parser?

Robert Watkins-2
In reply to this post by Mark Miller-3
Mark --

Yes please! I'm very interested in the mixing of boolean and proximity
operators. I have also worked on a parser (using JavaCC) but haven't
managed to crack queries such as:

     ((a OR b) AND c) NEAR (d NOT e)

I can get the parse tree okay, but haven't figured out how to translate
that into a valid Lucene Query object. Simple queries such as:

     (a OR b) NEXT (c OR d)   // note the use of OR exclusively!

are okay, but nothing more complex. So: bring it on!

-- Robert
rwatkins at foo-bar.org

On Mon, 21 Aug 2006, Mark Miller wrote:

> Is anyone interested in helping me test out a new query parser (i.e is
> anyone interested in using this, thereby helping me test it) ?
>
> The parser uses a intermediate parse tree representation, unlike Lucene's
> Query Filter.
>
> [ snipped ]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Test new query parser?

Mark Miller-3
You are not kidding. I keep finding this tougher and tougher. Originally
my method worked for most simple queries, but not all. I improved it to
cover some more ground, allowing some modestly complex queries, but now
even that improvement seems woefully inadequate for solving the general
case. I was handling some pretty complex queries, but losing order of
operations in a proximity query if things got too exciting. Among other
small bugs.

The parse tree handles the boolean stuff fine on its own, but the
proximity seems to require a distributed attack (think algebra) that
still maintains order of operations. It is a bit nasty.

consider:
cop | fowl & (fowl | priest & man) ! helicopter ~8 (hillary | tom)

this must distribute to (roughly)
cop | ( (fowl & ( (fowl ~8 hillary | (priest ~8 hillary & man ~8
hillary) ) ! helicopter ~8 hillary) | (fowl ~8 tom | (priest ~8 tom &
man ~8 tom) ) ! helicopter ~8 tom))

I have gotten close but instead of (fowl | (priest & man)) I might get
((fowl | priest) & man)...a naive distribution will ignore order of ops
and order by left to right.

my order of ops:
& "and"
| "or"
~ "within"
! "butnot"
<space between words>

I have a new plan of attack that I have begun, but who knows where it
will lead. I thought I was so close, but apparently just a tease...that
method could only take me so far. I hate to put so much work into this
since I doubt anyone will even use such complex queries (the queries I
monitor are always so basic) but I may give it a go just to see if my
new idea will solve the general case.

We will see if this parser actually has any life in it. Maybe I am no
closer than you where-- I am very new at this.

- Mark

> Mark --
>
> Yes please! I'm very interested in the mixing of boolean and proximity
> operators. I have also worked on a parser (using JavaCC) but haven't
> managed to crack queries such as:
>
>     ((a OR b) AND c) NEAR (d NOT e)
>
> I can get the parse tree okay, but haven't figured out how to translate
> that into a valid Lucene Query object. Simple queries such as:
>
>     (a OR b) NEXT (c OR d)   // note the use of OR exclusively!
>
> are okay, but nothing more complex. So: bring it on!
>
> -- Robert
> rwatkins at foo-bar.org
>
> On Mon, 21 Aug 2006, Mark Miller wrote:
>
>> Is anyone interested in helping me test out a new query parser (i.e is
>> anyone interested in using this, thereby helping me test it) ?
>>
>> The parser uses a intermediate parse tree representation, unlike
>> Lucene's
>> Query Filter.
>>
>> [ snipped ]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Test new query parser?

Robert Watkins-2
Mark --

Don't lose hope! We are migrating from Verity to Lucene, and I know for
a fact that we will have to support the kind of complex queries you talk
about; maybe not /quite/ as complex as your magnificent:

> cop | fowl & (fowl | priest & man) ! helicopter ~8 (hillary | tom)

but certainly more complex than I have been able to solve.

We can also take heart in that Verity hasn't quite cracked it either. In
trying to see exactly what I needed to support I was doing some
experiments against known data and discovered that Verity has some
parsing bugs that do not reveal themselves easily. For example:

   title:  "get me out of here, please"
   queryA: title:((here OR there) NEAR/2 please)
   queryB: title:(there OR here) NEAR/2 please)

In theory both queries should find the example title, but only queryB
works with Verity, so something is wrong.

-- Robert

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Test new query parser?

Mark Miller-3
It is interesting to note that Lucene would also seem to suffer from
bugs when using spans if you only have a single document in the index.
At least with the NotSpanQuery, the spans could wrap around the document
from end to beginning. This would be unexpected but would also go away
if you add a second doc. This is a limitation (if you could call it
that) of the spans implementation and not a bug though (I'm speculating
on all of this).

Hopefully my new idea will sort this problem out. This is a non-work
project so I don't have much time for it, but it is much more
interesting than my work so I will probably keep plugging away on off hours.

The new idea is to collect the terms in a composite object that will
maintain order of operations as well as the tokens.

I have much to do on this composite class, but this query seems to
generate an acceptable composite:

(me | cop & her) ~2 (old & women) :

cdistrib(cdistrib(distrib(allFields:me)'
'cdistrib(distrib(allFields:cop)'+'distrib(allFields:her)))) ~3
cdistrib(distrib(allFields:old)'+'distrib(allFields:women))

Then I will need a function that will distribute one composite across
another, creating the proximity query (I would guess a boolean at its
root). I am confident I can make the composite class work, but I have
not investigated how hard it will be to do the distribution. All of the
information is there though, so I am assuming it won't be too difficult
(don't i wish).

- Mark

> Mark --
>
> Don't lose hope! We are migrating from Verity to Lucene, and I know for
> a fact that we will have to support the kind of complex queries you talk
> about; maybe not /quite/ as complex as your magnificent:
>
>> cop | fowl & (fowl | priest & man) ! helicopter ~8 (hillary | tom)
>
> but certainly more complex than I have been able to solve.
>
> We can also take heart in that Verity hasn't quite cracked it either. In
> trying to see exactly what I needed to support I was doing some
> experiments against known data and discovered that Verity has some
> parsing bugs that do not reveal themselves easily. For example:
>
>   title:  "get me out of here, please"
>   queryA: title:((here OR there) NEAR/2 please)
>   queryB: title:(there OR here) NEAR/2 please)
>
> In theory both queries should find the example title, but only queryB
> works with Verity, so something is wrong.
>
> -- Robert
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Test new query parser?

Mark Miller-3
In reply to this post by Robert Watkins-2
I have received a few inquires about my new query parser. I apologize
for making that announcement a little premature. My current
implementation only allows simple mixing of proximity queries with
boolean queries...complex mixing would result in an incorrect search. A
reply to my first email made me consider this more (I had done that part
a while ago) and I came to the conclusion that it was obviously
unacceptable to release the parser to anyone in this hobbled form. The
parser must support arbitrary mixing of boolean and proximity searchers.

I think I have cracked this. I would say I am at 90%  of the way to a
solution and can see the light at the end of the tunnel. When I have
resolved this issue, I will contact those that have expressed interest
and provide them with the parser. With some feedback and improvements I
will think about how to release it generally.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]