wildcard and span queries

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

wildcard and span queries

Erick Erickson
Well, we defined this problem away for one of our products, but it's back
for a different product. Siiiiigggghhhh......

I'm valiantly trying to get our product manager (hereinafter PM) to define
this problem away, perhaps allowing me to deal with this by clever indexing
and/or some variant on prefix query. But in case that doesn't fly, I'm
wondering what wisdom exists.

Fortunately, the PM agrees that it's silly to think about span queries
involving OR or NOT for this app. So I'm left with something like Jo*n AND
sm*th AND jon?es WITHIN 6.

The only approach that's occurred to me is to create a filter on for the
terms, giving me a subset of my docs that have any terms satisfying the
above. For each doc in the filter, get creative with TermPositionVector for
determining whether the document matches. It seems that this would involve
creating a list of all positions in each doc in my filter that match jo*n,
another for sm*th, and another for jon?es and seeing if the distance
(however I define that) between any triple of terms (one from each list) is
less than 6.

My gut feel is that this explodes time-wise based upon the number of terms
that match. In this particular application, we are indexing 20K books. Based
on indexing 4K of them, this amounts to about a 4G index (although I
acutally expect this to be somewhat larger since I haven't indexed all the
fields, just the text so far). I can't imagine that comparing the expanded
terms for, say, 10,000 docs will be fast. I'm putting together an experiment
to test this though.

But someone could save me a lot of work by telling me that this is solved
already. This is your chance <G>......

The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery) all blow
up with TooManyClauses, and I've tried upping the MaxClauses field but that
takes forever and *then* blows up. Even with -Xmx set as high as I can.

I know, I know. If I solve this, feel free to submit it to the contribution
section.....

Thanks
Erick

P.S. Apologies if this is a re-post. But every time I try to submit a new
request from home, I get a error like this....
************
Technical details of permanent failure:
PERM_FAILURE: SMTP Error (state 12): 550 SpamAssassin score
5.1(DNS_FROM_RFC_ABUSE,HTML_00_10,HTML_MESSAGE,RCVD_IN_BL_SPAMCOP_NET)
exceeds threshold 5.0.
***********

Which appears to be related to the fact that I have a Direcway satellite
connection at home. Anybody who's figured out how to cure this. please feel
free to e-mail me. I don't quite know whether this is even getting to the
user list server or is getting returned from the Direcway processsing....
Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Paul Elschot
On Friday 06 October 2006 14:37, Erick Erickson wrote:
...
> Fortunately, the PM agrees that it's silly to think about span queries
> involving OR or NOT for this app. So I'm left with something like Jo*n AND
> sm*th AND jon?es WITHIN 6.

OR works much the same as term expansion for wildcards.

> The only approach that's occurred to me is to create a filter on for the
> terms, giving me a subset of my docs that have any terms satisfying the
> above. For each doc in the filter, get creative with TermPositionVector for
> determining whether the document matches. It seems that this would involve
> creating a list of all positions in each doc in my filter that match jo*n,
> another for sm*th, and another for jon?es and seeing if the distance
> (however I define that) between any triple of terms (one from each list) is
> less than 6.

> My gut feel is that this explodes time-wise based upon the number of terms
> that match. In this particular application, we are indexing 20K books. Based
> on indexing 4K of them, this amounts to about a 4G index (although I
> acutally expect this to be somewhat larger since I haven't indexed all the
> fields, just the text so far). I can't imagine that comparing the expanded
> terms for, say, 10,000 docs will be fast. I'm putting together an experiment
> to test this though.
>
> But someone could save me a lot of work by telling me that this is solved
> already. This is your chance <G>......

It's solved :) here:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/

The surround query language uses only the spans package for
WITHIN like queries, no filters.
You may not want to use the parser, but all the rest could be handy.
 
> The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery) all blow
> up with TooManyClauses, and I've tried upping the MaxClauses field but that
> takes forever and *then* blows up. Even with -Xmx set as high as I can.

The surround language has its own limitation on the maximum number
of terms expanded for wildcards, and it works nicely even for rather
high numbers of terms (thousands) for WITHIN like queries,
given enough RAM.

It shouldn't be too difficult to add NOT queries within WITHIN,
there already is a SpanNotQuery in Lucene to map onto.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Erick Erickson
Paul:

Splendid! Now if I just understood a single thing about the SrndQuery family
<G>.

I followed your link, and took a look at the text file. That should give me
enough to get started.

But if you wanted to e-mail me any sample code or long explanations of what
this all does, I would forever be your lackey <G>....

I should also fairly easily be able to run a few of these against the
partial index I already have to get some sense of now it'll all work out in
my problem space. I suspect that the actual number of distinct terms won't
grow too much after the first 4,000 books, so it'll probably be pretty safe
to get this running in the "worst case", find out if/where things blow up,
and put in some safeguards. Or perhaps discover that it's completely and
entirely perfect <G>.

Thanks again
Erick

On 10/6/06, Paul Elschot <[hidden email]> wrote:

>
> On Friday 06 October 2006 14:37, Erick Erickson wrote:
> ...
> > Fortunately, the PM agrees that it's silly to think about span queries
> > involving OR or NOT for this app. So I'm left with something like Jo*n
> AND
> > sm*th AND jon?es WITHIN 6.
>
> OR works much the same as term expansion for wildcards.
>
> > The only approach that's occurred to me is to create a filter on for the
> > terms, giving me a subset of my docs that have any terms satisfying the
> > above. For each doc in the filter, get creative with TermPositionVector
> for
> > determining whether the document matches. It seems that this would
> involve
> > creating a list of all positions in each doc in my filter that match
> jo*n,
> > another for sm*th, and another for jon?es and seeing if the distance
> > (however I define that) between any triple of terms (one from each list)
> is
> > less than 6.
>
> > My gut feel is that this explodes time-wise based upon the number of
> terms
> > that match. In this particular application, we are indexing 20K books.
> Based
> > on indexing 4K of them, this amounts to about a 4G index (although I
> > acutally expect this to be somewhat larger since I haven't indexed all
> the
> > fields, just the text so far). I can't imagine that comparing the
> expanded
> > terms for, say, 10,000 docs will be fast. I'm putting together an
> experiment
> > to test this though.
> >
> > But someone could save me a lot of work by telling me that this is
> solved
> > already. This is your chance <G>......
>
> It's solved :) here:
> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
>
> The surround query language uses only the spans package for
> WITHIN like queries, no filters.
> You may not want to use the parser, but all the rest could be handy.
>
> > The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery) all
> blow
> > up with TooManyClauses, and I've tried upping the MaxClauses field but
> that
> > takes forever and *then* blows up. Even with -Xmx set as high as I can.
>
> The surround language has its own limitation on the maximum number
> of terms expanded for wildcards, and it works nicely even for rather
> high numbers of terms (thousands) for WITHIN like queries,
> given enough RAM.
>
> It shouldn't be too difficult to add NOT queries within WITHIN,
> there already is a SpanNotQuery in Lucene to map onto.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Paul Elschot
Erick,

On Friday 06 October 2006 22:01, Erick Erickson wrote:

> Paul:
>
> Splendid! Now if I just understood a single thing about the SrndQuery family
> <G>.
>
> I followed your link, and took a look at the text file. That should give me
> enough to get started.
>
> But if you wanted to e-mail me any sample code or long explanations of what
> this all does, I would forever be your lackey <G>....

Correcting any bug in the surround source code would do...

> I should also fairly easily be able to run a few of these against the
> partial index I already have to get some sense of now it'll all work out in
> my problem space. I suspect that the actual number of distinct terms won't
> grow too much after the first 4,000 books, so it'll probably be pretty safe
> to get this running in the "worst case", find out if/where things blow up,
> and put in some safeguards. Or perhaps discover that it's completely and
> entirely perfect <G>.

Have a look at the surround test code, and then try and use the surround
parser as a replacement for the standard query parser.
That should be doable, and it will at least allow you to get some
(nested) proximity queries running against your own data.
The next step is to replace the parser with your own, assuming you
have one already.

I never got round to writing javadocs for it, and I realize that is
an obstacle. The problem is that writing good javadocs takes me
about as much time as writing the code in the first place.

There is a safeguard in the surround code, it's built into the
BasicQueryFactory class. It is used at the bottom
of the surround code to generate no more than a maximum
of Lucene TermQuery's and SpanTermQuery's.
The top of the surround code is its parser.

Regards,
Paul Elschot

 

> Thanks again
> Erick
>
> On 10/6/06, Paul Elschot <[hidden email]> wrote:
> >
> > On Friday 06 October 2006 14:37, Erick Erickson wrote:
> > ...
> > > Fortunately, the PM agrees that it's silly to think about span queries
> > > involving OR or NOT for this app. So I'm left with something like Jo*n
> > AND
> > > sm*th AND jon?es WITHIN 6.
> >
> > OR works much the same as term expansion for wildcards.
> >
> > > The only approach that's occurred to me is to create a filter on for the
> > > terms, giving me a subset of my docs that have any terms satisfying the
> > > above. For each doc in the filter, get creative with TermPositionVector
> > for
> > > determining whether the document matches. It seems that this would
> > involve
> > > creating a list of all positions in each doc in my filter that match
> > jo*n,
> > > another for sm*th, and another for jon?es and seeing if the distance
> > > (however I define that) between any triple of terms (one from each list)
> > is
> > > less than 6.
> >
> > > My gut feel is that this explodes time-wise based upon the number of
> > terms
> > > that match. In this particular application, we are indexing 20K books.
> > Based
> > > on indexing 4K of them, this amounts to about a 4G index (although I
> > > acutally expect this to be somewhat larger since I haven't indexed all
> > the
> > > fields, just the text so far). I can't imagine that comparing the
> > expanded
> > > terms for, say, 10,000 docs will be fast. I'm putting together an
> > experiment
> > > to test this though.
> > >
> > > But someone could save me a lot of work by telling me that this is
> > solved
> > > already. This is your chance <G>......
> >
> > It's solved :) here:
> > http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
> >
> > The surround query language uses only the spans package for
> > WITHIN like queries, no filters.
> > You may not want to use the parser, but all the rest could be handy.
> >
> > > The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery) all
> > blow
> > > up with TooManyClauses, and I've tried upping the MaxClauses field but
> > that
> > > takes forever and *then* blows up. Even with -Xmx set as high as I can.
> >
> > The surround language has its own limitation on the maximum number
> > of terms expanded for wildcards, and it works nicely even for rather
> > high numbers of terms (thousands) for WITHIN like queries,
> > given enough RAM.
> >
> > It shouldn't be too difficult to add NOT queries within WITHIN,
> > there already is a SpanNotQuery in Lucene to map onto.
> >
> > Regards,
> > Paul Elschot
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Mark Miller-3
In reply to this post by Erick Erickson
Paul's parser is beyond my feeble comprehension...but I would start by
looking at SrndTruncQuery. It looks to me like this enumerates each
possible match just like a SpanRegexQuery does...I am too lazy to figure
out what the visitor pattern is doing so I don't know if they then get
added to a boolean query, but I don't know what else would happen. If
this is the case, I am wondering if it is any more efficient than the
SpanRegex implementation...which could be changed to a SpanWildcard
implementation. How exactly is this better at avoiding a toomanyclauses
exception or ram fillup. Is it just the fact that the (lets say) three
wildcard terms are anded so this should dramatically reduce the matches?
I don't want to sound any stupider, so I will stop there--hopefully Paul
will expound on this.

- Mark

Erick Erickson wrote:

> Paul:
>
> Splendid! Now if I just understood a single thing about the SrndQuery
> family
> <G>.
>
> I followed your link, and took a look at the text file. That should
> give me
> enough to get started.
>
> But if you wanted to e-mail me any sample code or long explanations of
> what
> this all does, I would forever be your lackey <G>....
>
> I should also fairly easily be able to run a few of these against the
> partial index I already have to get some sense of now it'll all work
> out in
> my problem space. I suspect that the actual number of distinct terms
> won't
> grow too much after the first 4,000 books, so it'll probably be pretty
> safe
> to get this running in the "worst case", find out if/where things blow
> up,
> and put in some safeguards. Or perhaps discover that it's completely and
> entirely perfect <G>.
>
> Thanks again
> Erick
>
> On 10/6/06, Paul Elschot <[hidden email]> wrote:
>>
>> On Friday 06 October 2006 14:37, Erick Erickson wrote:
>> ...
>> > Fortunately, the PM agrees that it's silly to think about span queries
>> > involving OR or NOT for this app. So I'm left with something like Jo*n
>> AND
>> > sm*th AND jon?es WITHIN 6.
>>
>> OR works much the same as term expansion for wildcards.
>>
>> > The only approach that's occurred to me is to create a filter on
>> for the
>> > terms, giving me a subset of my docs that have any terms satisfying
>> the
>> > above. For each doc in the filter, get creative with
>> TermPositionVector
>> for
>> > determining whether the document matches. It seems that this would
>> involve
>> > creating a list of all positions in each doc in my filter that match
>> jo*n,
>> > another for sm*th, and another for jon?es and seeing if the distance
>> > (however I define that) between any triple of terms (one from each
>> list)
>> is
>> > less than 6.
>>
>> > My gut feel is that this explodes time-wise based upon the number of
>> terms
>> > that match. In this particular application, we are indexing 20K books.
>> Based
>> > on indexing 4K of them, this amounts to about a 4G index (although I
>> > acutally expect this to be somewhat larger since I haven't indexed all
>> the
>> > fields, just the text so far). I can't imagine that comparing the
>> expanded
>> > terms for, say, 10,000 docs will be fast. I'm putting together an
>> experiment
>> > to test this though.
>> >
>> > But someone could save me a lot of work by telling me that this is
>> solved
>> > already. This is your chance <G>......
>>
>> It's solved :) here:
>> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
>>
>> The surround query language uses only the spans package for
>> WITHIN like queries, no filters.
>> You may not want to use the parser, but all the rest could be handy.
>>
>> > The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery)
>> all
>> blow
>> > up with TooManyClauses, and I've tried upping the MaxClauses field but
>> that
>> > takes forever and *then* blows up. Even with -Xmx set as high as I
>> can.
>>
>> The surround language has its own limitation on the maximum number
>> of terms expanded for wildcards, and it works nicely even for rather
>> high numbers of terms (thousands) for WITHIN like queries,
>> given enough RAM.
>>
>> It shouldn't be too difficult to add NOT queries within WITHIN,
>> there already is a SpanNotQuery in Lucene to map onto.
>>
>> Regards,
>> Paul Elschot
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Paul Elschot
Mark,

On Friday 06 October 2006 22:46, Mark Miller wrote:
> Paul's parser is beyond my feeble comprehension...but I would start by
> looking at SrndTruncQuery. It looks to me like this enumerates each
> possible match just like a SpanRegexQuery does...I am too lazy to figure
> out what the visitor pattern is doing so I don't know if they then get
> added to a boolean query, but I don't know what else would happen. If

They can also be added to a SpanOrQuery as SpanTermQuery,
this depends on the context of the query (distance query or not).
The visitor pattern is used to have the same code for distance queries
and other queries as far as possible.

> this is the case, I am wondering if it is any more efficient than the
> SpanRegex implementation...which could be changed to a SpanWildcard

I don't think the surround implementation of expanding terms is more
efficient that the Lucene implementation.
Surround does have the functionality of a SpanWildCard, but
the implementation of the expansion is shared, see above.

> implementation. How exactly is this better at avoiding a toomanyclauses
> exception or ram fillup. Is it just the fact that the (lets say) three
> wildcard terms are anded so this should dramatically reduce the matches?

The limitation in BasicQueryFactory works for a complete surround query,
which can be nested.
In Lucene only the max nr of clauses for a single level BooleanQuery
can be controlled.

 >...

Regards,
Paul Elschot

 

> - Mark
>
> Erick Erickson wrote:
> > Paul:
> >
> > Splendid! Now if I just understood a single thing about the SrndQuery
> > family
> > <G>.
> >
> > I followed your link, and took a look at the text file. That should
> > give me
> > enough to get started.
> >
> > But if you wanted to e-mail me any sample code or long explanations of
> > what
> > this all does, I would forever be your lackey <G>....
> >
> > I should also fairly easily be able to run a few of these against the
> > partial index I already have to get some sense of now it'll all work
> > out in
> > my problem space. I suspect that the actual number of distinct terms
> > won't
> > grow too much after the first 4,000 books, so it'll probably be pretty
> > safe
> > to get this running in the "worst case", find out if/where things blow
> > up,
> > and put in some safeguards. Or perhaps discover that it's completely and
> > entirely perfect <G>.
> >
> > Thanks again
> > Erick
> >
> > On 10/6/06, Paul Elschot <[hidden email]> wrote:
> >>
> >> On Friday 06 October 2006 14:37, Erick Erickson wrote:
> >> ...
> >> > Fortunately, the PM agrees that it's silly to think about span queries
> >> > involving OR or NOT for this app. So I'm left with something like Jo*n
> >> AND
> >> > sm*th AND jon?es WITHIN 6.
> >>
> >> OR works much the same as term expansion for wildcards.
> >>
> >> > The only approach that's occurred to me is to create a filter on
> >> for the
> >> > terms, giving me a subset of my docs that have any terms satisfying
> >> the
> >> > above. For each doc in the filter, get creative with
> >> TermPositionVector
> >> for
> >> > determining whether the document matches. It seems that this would
> >> involve
> >> > creating a list of all positions in each doc in my filter that match
> >> jo*n,
> >> > another for sm*th, and another for jon?es and seeing if the distance
> >> > (however I define that) between any triple of terms (one from each
> >> list)
> >> is
> >> > less than 6.
> >>
> >> > My gut feel is that this explodes time-wise based upon the number of
> >> terms
> >> > that match. In this particular application, we are indexing 20K books.
> >> Based
> >> > on indexing 4K of them, this amounts to about a 4G index (although I
> >> > acutally expect this to be somewhat larger since I haven't indexed all
> >> the
> >> > fields, just the text so far). I can't imagine that comparing the
> >> expanded
> >> > terms for, say, 10,000 docs will be fast. I'm putting together an
> >> experiment
> >> > to test this though.
> >> >
> >> > But someone could save me a lot of work by telling me that this is
> >> solved
> >> > already. This is your chance <G>......
> >>
> >> It's solved :) here:
> >> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
> >>
> >> The surround query language uses only the spans package for
> >> WITHIN like queries, no filters.
> >> You may not want to use the parser, but all the rest could be handy.
> >>
> >> > The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery)
> >> all
> >> blow
> >> > up with TooManyClauses, and I've tried upping the MaxClauses field but
> >> that
> >> > takes forever and *then* blows up. Even with -Xmx set as high as I
> >> can.
> >>
> >> The surround language has its own limitation on the maximum number
> >> of terms expanded for wildcards, and it works nicely even for rather
> >> high numbers of terms (thousands) for WITHIN like queries,
> >> given enough RAM.
> >>
> >> It shouldn't be too difficult to add NOT queries within WITHIN,
> >> there already is a SpanNotQuery in Lucene to map onto.
> >>
> >> Regards,
> >> Paul Elschot
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Erick Erickson
OK, I'm using the surround code, and it seems to be working...with the
following questions (always, more questions)...

> I'm gettng an exception sometimes of TooManyBasicQueries. I can control
this by initializing BasicQueryFactory with a larger number. Do you have any
cautions about upping this number?

> There's a hard-coded value minimumPrefixLength set to 3 down in the code
Surround query parser (allowedSuffix). I see no method to change this. I
assume that this is to prevent using up too much memory/time. What should I
know about this value? I'm mostly interested in a justification for the
product manager why allowing, say, two character (or one character) prefixes
is a bad idea <G>.

> I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal to
Surround queries. That is, trying RegexSpanQuery doesn't want to work at all
with the same search clause, as it runs out of memory pretty quickly......

However, working with three-letter prefixes is blazingly fast.........

Thanks again...

Erick

On 10/6/06, Paul Elschot <[hidden email]> wrote:

>
> Mark,
>
> On Friday 06 October 2006 22:46, Mark Miller wrote:
> > Paul's parser is beyond my feeble comprehension...but I would start by
> > looking at SrndTruncQuery. It looks to me like this enumerates each
> > possible match just like a SpanRegexQuery does...I am too lazy to figure
> > out what the visitor pattern is doing so I don't know if they then get
> > added to a boolean query, but I don't know what else would happen. If
>
> They can also be added to a SpanOrQuery as SpanTermQuery,
> this depends on the context of the query (distance query or not).
> The visitor pattern is used to have the same code for distance queries
> and other queries as far as possible.
>
> > this is the case, I am wondering if it is any more efficient than the
> > SpanRegex implementation...which could be changed to a SpanWildcard
>
> I don't think the surround implementation of expanding terms is more
> efficient that the Lucene implementation.
> Surround does have the functionality of a SpanWildCard, but
> the implementation of the expansion is shared, see above.
>
> > implementation. How exactly is this better at avoiding a toomanyclauses
> > exception or ram fillup. Is it just the fact that the (lets say) three
> > wildcard terms are anded so this should dramatically reduce the matches?
>
> The limitation in BasicQueryFactory works for a complete surround query,
> which can be nested.
> In Lucene only the max nr of clauses for a single level BooleanQuery
> can be controlled.
>
> >...
>
> Regards,
> Paul Elschot
>
>
> > - Mark
> >
> > Erick Erickson wrote:
> > > Paul:
> > >
> > > Splendid! Now if I just understood a single thing about the SrndQuery
> > > family
> > > <G>.
> > >
> > > I followed your link, and took a look at the text file. That should
> > > give me
> > > enough to get started.
> > >
> > > But if you wanted to e-mail me any sample code or long explanations of
> > > what
> > > this all does, I would forever be your lackey <G>....
> > >
> > > I should also fairly easily be able to run a few of these against the
> > > partial index I already have to get some sense of now it'll all work
> > > out in
> > > my problem space. I suspect that the actual number of distinct terms
> > > won't
> > > grow too much after the first 4,000 books, so it'll probably be pretty
> > > safe
> > > to get this running in the "worst case", find out if/where things blow
> > > up,
> > > and put in some safeguards. Or perhaps discover that it's completely
> and
> > > entirely perfect <G>.
> > >
> > > Thanks again
> > > Erick
> > >
> > > On 10/6/06, Paul Elschot <[hidden email]> wrote:
> > >>
> > >> On Friday 06 October 2006 14:37, Erick Erickson wrote:
> > >> ...
> > >> > Fortunately, the PM agrees that it's silly to think about span
> queries
> > >> > involving OR or NOT for this app. So I'm left with something like
> Jo*n
> > >> AND
> > >> > sm*th AND jon?es WITHIN 6.
> > >>
> > >> OR works much the same as term expansion for wildcards.
> > >>
> > >> > The only approach that's occurred to me is to create a filter on
> > >> for the
> > >> > terms, giving me a subset of my docs that have any terms satisfying
> > >> the
> > >> > above. For each doc in the filter, get creative with
> > >> TermPositionVector
> > >> for
> > >> > determining whether the document matches. It seems that this would
> > >> involve
> > >> > creating a list of all positions in each doc in my filter that
> match
> > >> jo*n,
> > >> > another for sm*th, and another for jon?es and seeing if the
> distance
> > >> > (however I define that) between any triple of terms (one from each
> > >> list)
> > >> is
> > >> > less than 6.
> > >>
> > >> > My gut feel is that this explodes time-wise based upon the number
> of
> > >> terms
> > >> > that match. In this particular application, we are indexing 20K
> books.
> > >> Based
> > >> > on indexing 4K of them, this amounts to about a 4G index (although
> I
> > >> > acutally expect this to be somewhat larger since I haven't indexed
> all
> > >> the
> > >> > fields, just the text so far). I can't imagine that comparing the
> > >> expanded
> > >> > terms for, say, 10,000 docs will be fast. I'm putting together an
> > >> experiment
> > >> > to test this though.
> > >> >
> > >> > But someone could save me a lot of work by telling me that this is
> > >> solved
> > >> > already. This is your chance <G>......
> > >>
> > >> It's solved :) here:
> > >> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
> > >>
> > >> The surround query language uses only the spans package for
> > >> WITHIN like queries, no filters.
> > >> You may not want to use the parser, but all the rest could be handy.
> > >>
> > >> > The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery)
> > >> all
> > >> blow
> > >> > up with TooManyClauses, and I've tried upping the MaxClauses field
> but
> > >> that
> > >> > takes forever and *then* blows up. Even with -Xmx set as high as I
> > >> can.
> > >>
> > >> The surround language has its own limitation on the maximum number
> > >> of terms expanded for wildcards, and it works nicely even for rather
> > >> high numbers of terms (thousands) for WITHIN like queries,
> > >> given enough RAM.
> > >>
> > >> It shouldn't be too difficult to add NOT queries within WITHIN,
> > >> there already is a SpanNotQuery in Lucene to map onto.
> > >>
> > >> Regards,
> > >> Paul Elschot
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [hidden email]
> > >> For additional commands, e-mail: [hidden email]
> > >>
> > >>
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Erick Erickson
OK, forget the stuff about "TooManyBooleanClauses". I finally figured out
that if I specify the surround to have the same semantics as a SpanRegex (
i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that makes
more sense to me now.

Specifying 20w(eri*, mal*) is what I was using before.

Erick

On 10/9/06, Erick Erickson <[hidden email]> wrote:

>
> OK, I'm using the surround code, and it seems to be working...with the
> following questions (always, more questions)...
>
> > I'm gettng an exception sometimes of TooManyBasicQueries. I can control
> this by initializing BasicQueryFactory with a larger number. Do you have any
> cautions about upping this number?
>
> > There's a hard-coded value minimumPrefixLength set to 3 down in the code
> Surround query parser (allowedSuffix). I see no method to change this. I
> assume that this is to prevent using up too much memory/time. What should I
> know about this value? I'm mostly interested in a justification for the
> product manager why allowing, say, two character (or one character) prefixes
> is a bad idea <G>.
>
> > I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal
> to Surround queries. That is, trying RegexSpanQuery doesn't want to work at
> all with the same search clause, as it runs out of memory pretty
> quickly......
>
> However, working with three-letter prefixes is blazingly fast.........
>
> Thanks again...
>
> Erick
>
> On 10/6/06, Paul Elschot < [hidden email]> wrote:
> >
> > Mark,
> >
> > On Friday 06 October 2006 22:46, Mark Miller wrote:
> > > Paul's parser is beyond my feeble comprehension...but I would start by
> > > looking at SrndTruncQuery. It looks to me like this enumerates each
> > > possible match just like a SpanRegexQuery does...I am too lazy to
> > figure
> > > out what the visitor pattern is doing so I don't know if they then get
> > > added to a boolean query, but I don't know what else would happen. If
> >
> > They can also be added to a SpanOrQuery as SpanTermQuery,
> > this depends on the context of the query (distance query or not).
> > The visitor pattern is used to have the same code for distance queries
> > and other queries as far as possible.
> >
> > > this is the case, I am wondering if it is any more efficient than the
> > > SpanRegex implementation...which could be changed to a SpanWildcard
> >
> > I don't think the surround implementation of expanding terms is more
> > efficient that the Lucene implementation.
> > Surround does have the functionality of a SpanWildCard, but
> > the implementation of the expansion is shared, see above.
> >
> > > implementation. How exactly is this better at avoiding a
> > toomanyclauses
> > > exception or ram fillup. Is it just the fact that the (lets say) three
> >
> > > wildcard terms are anded so this should dramatically reduce the
> > matches?
> >
> > The limitation in BasicQueryFactory works for a complete surround query,
> > which can be nested.
> > In Lucene only the max nr of clauses for a single level BooleanQuery
> > can be controlled.
> >
> > >...
> >
> > Regards,
> > Paul Elschot
> >
> >
> > > - Mark
> > >
> > > Erick Erickson wrote:
> > > > Paul:
> > > >
> > > > Splendid! Now if I just understood a single thing about the
> > SrndQuery
> > > > family
> > > > <G>.
> > > >
> > > > I followed your link, and took a look at the text file. That should
> > > > give me
> > > > enough to get started.
> > > >
> > > > But if you wanted to e-mail me any sample code or long explanations
> > of
> > > > what
> > > > this all does, I would forever be your lackey <G>....
> > > >
> > > > I should also fairly easily be able to run a few of these against
> > the
> > > > partial index I already have to get some sense of now it'll all work
> >
> > > > out in
> > > > my problem space. I suspect that the actual number of distinct terms
> > > > won't
> > > > grow too much after the first 4,000 books, so it'll probably be
> > pretty
> > > > safe
> > > > to get this running in the "worst case", find out if/where things
> > blow
> > > > up,
> > > > and put in some safeguards. Or perhaps discover that it's completely
> > and
> > > > entirely perfect <G>.
> > > >
> > > > Thanks again
> > > > Erick
> > > >
> > > > On 10/6/06, Paul Elschot <[hidden email]> wrote:
> > > >>
> > > >> On Friday 06 October 2006 14:37, Erick Erickson wrote:
> > > >> ...
> > > >> > Fortunately, the PM agrees that it's silly to think about span
> > queries
> > > >> > involving OR or NOT for this app. So I'm left with something like
> > Jo*n
> > > >> AND
> > > >> > sm*th AND jon?es WITHIN 6.
> > > >>
> > > >> OR works much the same as term expansion for wildcards.
> > > >>
> > > >> > The only approach that's occurred to me is to create a filter on
> > > >> for the
> > > >> > terms, giving me a subset of my docs that have any terms
> > satisfying
> > > >> the
> > > >> > above. For each doc in the filter, get creative with
> > > >> TermPositionVector
> > > >> for
> > > >> > determining whether the document matches. It seems that this
> > would
> > > >> involve
> > > >> > creating a list of all positions in each doc in my filter that
> > match
> > > >> jo*n,
> > > >> > another for sm*th, and another for jon?es and seeing if the
> > distance
> > > >> > (however I define that) between any triple of terms (one from
> > each
> > > >> list)
> > > >> is
> > > >> > less than 6.
> > > >>
> > > >> > My gut feel is that this explodes time-wise based upon the number
> > of
> > > >> terms
> > > >> > that match. In this particular application, we are indexing 20K
> > books.
> > > >> Based
> > > >> > on indexing 4K of them, this amounts to about a 4G index
> > (although I
> > > >> > acutally expect this to be somewhat larger since I haven't
> > indexed all
> > > >> the
> > > >> > fields, just the text so far). I can't imagine that comparing the
> > > >> expanded
> > > >> > terms for, say, 10,000 docs will be fast. I'm putting together an
> > > >> experiment
> > > >> > to test this though.
> > > >> >
> > > >> > But someone could save me a lot of work by telling me that this
> > is
> > > >> solved
> > > >> > already. This is your chance <G>......
> > > >>
> > > >> It's solved :) here:
> > > >> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
> > > >>
> > > >> The surround query language uses only the spans package for
> > > >> WITHIN like queries, no filters.
> > > >> You may not want to use the parser, but all the rest could be
> > handy.
> > > >>
> > > >> > The expanding queries (e.g. PrefixQuery, RegexQuery,
> > WildcardQuery)
> > > >> all
> > > >> blow
> > > >> > up with TooManyClauses, and I've tried upping the MaxClauses
> > field but
> > > >> that
> > > >> > takes forever and *then* blows up. Even with -Xmx set as high as
> > I
> > > >> can.
> > > >>
> > > >> The surround language has its own limitation on the maximum number
> > > >> of terms expanded for wildcards, and it works nicely even for
> > rather
> > > >> high numbers of terms (thousands) for WITHIN like queries,
> > > >> given enough RAM.
> > > >>
> > > >> It shouldn't be too difficult to add NOT queries within WITHIN,
> > > >> there already is a SpanNotQuery in Lucene to map onto.
> > > >>
> > > >> Regards,
> > > >> Paul Elschot
> > > >>
> > > >>
> > ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: [hidden email]
> > > >> For additional commands, e-mail: [hidden email]
> > > >>
> > > >>
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Paul Elschot
Erick,

On Monday 09 October 2006 21:20, Erick Erickson wrote:

> OK, forget the stuff about "TooManyBooleanClauses". I finally figured out
> that if I specify the surround to have the same semantics as a SpanRegex (
> i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that makes
> more sense to me now.
>
> Specifying 20w(eri*, mal*) is what I was using before.
>
> Erick
>
> On 10/9/06, Erick Erickson <[hidden email]> wrote:
> >
> > OK, I'm using the surround code, and it seems to be working...with the
> > following questions (always, more questions)...
> >
> > > I'm gettng an exception sometimes of TooManyBasicQueries. I can control
> > this by initializing BasicQueryFactory with a larger number. Do you have
any
> > cautions about upping this number?
> >
> > > There's a hard-coded value minimumPrefixLength set to 3 down in the code
> > Surround query parser (allowedSuffix). I see no method to change this. I
> > assume that this is to prevent using up too much memory/time. What should
I
> > know about this value? I'm mostly interested in a justification for the
> > product manager why allowing, say, two character (or one character)
prefixes
> > is a bad idea <G>.

Once BasicQueryFactory has a satisfactory limitation, that is one that
a user can understand when the exception for too many basic queries
is thrown, there is no need to keep this minimim prefix length at 3,
1 or 2 will also do. When using many thousands as the max. basic queries,
the term expansion itself might take some time to reach that maximum.

You might want to ask the PM for a reasonable query involving such short
prefixes, though. In most western languages, they do not make much sense.

> >
> > > I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal
> > to Surround queries. That is, trying RegexSpanQuery doesn't want to work
at
> > all with the same search clause, as it runs out of memory pretty
> > quickly......
> >
> > However, working with three-letter prefixes is blazingly fast.........

Your index is probably not very large (yet). Make sure to reevaluate
the max. number of basic queries as it grows.

Did you try nesting like this:
20d( 4w(lucene, action), 5d(hatch*, gospod*))
?

Could you tell a bit more about the target grammar?

Regards,
Paul Elschot


> >
> > Thanks again...
> >
> > Erick
> >
> > On 10/6/06, Paul Elschot < [hidden email]> wrote:
> > >
> > > Mark,
> > >
> > > On Friday 06 October 2006 22:46, Mark Miller wrote:
> > > > Paul's parser is beyond my feeble comprehension...but I would start by
> > > > looking at SrndTruncQuery. It looks to me like this enumerates each
> > > > possible match just like a SpanRegexQuery does...I am too lazy to
> > > figure
> > > > out what the visitor pattern is doing so I don't know if they then get
> > > > added to a boolean query, but I don't know what else would happen. If
> > >
> > > They can also be added to a SpanOrQuery as SpanTermQuery,
> > > this depends on the context of the query (distance query or not).
> > > The visitor pattern is used to have the same code for distance queries
> > > and other queries as far as possible.
> > >
> > > > this is the case, I am wondering if it is any more efficient than the
> > > > SpanRegex implementation...which could be changed to a SpanWildcard
> > >
> > > I don't think the surround implementation of expanding terms is more
> > > efficient that the Lucene implementation.
> > > Surround does have the functionality of a SpanWildCard, but
> > > the implementation of the expansion is shared, see above.
> > >
> > > > implementation. How exactly is this better at avoiding a
> > > toomanyclauses
> > > > exception or ram fillup. Is it just the fact that the (lets say) three
> > >
> > > > wildcard terms are anded so this should dramatically reduce the
> > > matches?
> > >
> > > The limitation in BasicQueryFactory works for a complete surround query,
> > > which can be nested.
> > > In Lucene only the max nr of clauses for a single level BooleanQuery
> > > can be controlled.
> > >
> > > >...
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > >
> > > > - Mark
> > > >
> > > > Erick Erickson wrote:
> > > > > Paul:
> > > > >
> > > > > Splendid! Now if I just understood a single thing about the
> > > SrndQuery
> > > > > family
> > > > > <G>.
> > > > >
> > > > > I followed your link, and took a look at the text file. That should
> > > > > give me
> > > > > enough to get started.
> > > > >
> > > > > But if you wanted to e-mail me any sample code or long explanations
> > > of
> > > > > what
> > > > > this all does, I would forever be your lackey <G>....
> > > > >
> > > > > I should also fairly easily be able to run a few of these against
> > > the
> > > > > partial index I already have to get some sense of now it'll all work
> > >
> > > > > out in
> > > > > my problem space. I suspect that the actual number of distinct terms
> > > > > won't
> > > > > grow too much after the first 4,000 books, so it'll probably be
> > > pretty
> > > > > safe
> > > > > to get this running in the "worst case", find out if/where things
> > > blow
> > > > > up,
> > > > > and put in some safeguards. Or perhaps discover that it's completely
> > > and
> > > > > entirely perfect <G>.
> > > > >
> > > > > Thanks again
> > > > > Erick
> > > > >
> > > > > On 10/6/06, Paul Elschot <[hidden email]> wrote:
> > > > >>
> > > > >> On Friday 06 October 2006 14:37, Erick Erickson wrote:
> > > > >> ...
> > > > >> > Fortunately, the PM agrees that it's silly to think about span
> > > queries
> > > > >> > involving OR or NOT for this app. So I'm left with something like
> > > Jo*n
> > > > >> AND
> > > > >> > sm*th AND jon?es WITHIN 6.
> > > > >>
> > > > >> OR works much the same as term expansion for wildcards.
> > > > >>
> > > > >> > The only approach that's occurred to me is to create a filter on
> > > > >> for the
> > > > >> > terms, giving me a subset of my docs that have any terms
> > > satisfying
> > > > >> the
> > > > >> > above. For each doc in the filter, get creative with
> > > > >> TermPositionVector
> > > > >> for
> > > > >> > determining whether the document matches. It seems that this
> > > would
> > > > >> involve
> > > > >> > creating a list of all positions in each doc in my filter that
> > > match
> > > > >> jo*n,
> > > > >> > another for sm*th, and another for jon?es and seeing if the
> > > distance
> > > > >> > (however I define that) between any triple of terms (one from
> > > each
> > > > >> list)
> > > > >> is
> > > > >> > less than 6.
> > > > >>
> > > > >> > My gut feel is that this explodes time-wise based upon the number
> > > of
> > > > >> terms
> > > > >> > that match. In this particular application, we are indexing 20K
> > > books.
> > > > >> Based
> > > > >> > on indexing 4K of them, this amounts to about a 4G index
> > > (although I
> > > > >> > acutally expect this to be somewhat larger since I haven't
> > > indexed all
> > > > >> the
> > > > >> > fields, just the text so far). I can't imagine that comparing the
> > > > >> expanded
> > > > >> > terms for, say, 10,000 docs will be fast. I'm putting together an
> > > > >> experiment
> > > > >> > to test this though.
> > > > >> >
> > > > >> > But someone could save me a lot of work by telling me that this
> > > is
> > > > >> solved
> > > > >> > already. This is your chance <G>......
> > > > >>
> > > > >> It's solved :) here:
> > > > >> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
> > > > >>
> > > > >> The surround query language uses only the spans package for
> > > > >> WITHIN like queries, no filters.
> > > > >> You may not want to use the parser, but all the rest could be
> > > handy.
> > > > >>
> > > > >> > The expanding queries (e.g. PrefixQuery, RegexQuery,
> > > WildcardQuery)
> > > > >> all
> > > > >> blow
> > > > >> > up with TooManyClauses, and I've tried upping the MaxClauses
> > > field but
> > > > >> that
> > > > >> > takes forever and *then* blows up. Even with -Xmx set as high as
> > > I
> > > > >> can.
> > > > >>
> > > > >> The surround language has its own limitation on the maximum number
> > > > >> of terms expanded for wildcards, and it works nicely even for
> > > rather
> > > > >> high numbers of terms (thousands) for WITHIN like queries,
> > > > >> given enough RAM.
> > > > >>
> > > > >> It shouldn't be too difficult to add NOT queries within WITHIN,
> > > > >> there already is a SpanNotQuery in Lucene to map onto.
> > > > >>
> > > > >> Regards,
> > > > >> Paul Elschot
> > > > >>
> > > > >>
> > > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: [hidden email]
> > > > >> For additional commands, e-mail: [hidden email]
> > > > >>
> > > > >>
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Erick Erickson
I've already started that conversation with the PM, I'm just trying to get a
better idea of what's possible. I'll whimper tooth and nail to keep from
having to do a lot of work to add a feature to a product that nobody in
their right mind would ever use <G>.

As far as the grammar, we don't actually have one yet. That's part of what
this exploration is all about. The kicker is that what we are indexing is
OCR data, some of which is pretty trashy. So you wind up with "interesting"
words in your index, things like rtyHrS. So the whole question of allowing
very specific queries on detailed wildcards (combined with spans) is under
discussion. It's not at all clear to me that there's any value to the end
users in the capability of, say, two character prefixes. And, it's an easy
rule that "prefix queries must specify at least 3 non-wildcard
characters"....

Thanks for your advice. You're quite correct that the index isn't very large
yet. My task tonight is to index about 4K books. I suspect that the number
of terms won't increase dramatically after that many books, but that's an
assumption on my part.

Thanks again
Erick

On 10/9/06, Paul Elschot <[hidden email]> wrote:

>
> Erick,
>
> On Monday 09 October 2006 21:20, Erick Erickson wrote:
> > OK, forget the stuff about "TooManyBooleanClauses". I finally figured
> out
> > that if I specify the surround to have the same semantics as a SpanRegex
> (
> > i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that
> makes
> > more sense to me now.
> >
> > Specifying 20w(eri*, mal*) is what I was using before.
> >
> > Erick
> >
> > On 10/9/06, Erick Erickson <[hidden email]> wrote:
> > >
> > > OK, I'm using the surround code, and it seems to be working...with the
> > > following questions (always, more questions)...
> > >
> > > > I'm gettng an exception sometimes of TooManyBasicQueries. I can
> control
> > > this by initializing BasicQueryFactory with a larger number. Do you
> have
> any
> > > cautions about upping this number?
> > >
> > > > There's a hard-coded value minimumPrefixLength set to 3 down in the
> code
> > > Surround query parser (allowedSuffix). I see no method to change this.
> I
> > > assume that this is to prevent using up too much memory/time. What
> should
> I
> > > know about this value? I'm mostly interested in a justification for
> the
> > > product manager why allowing, say, two character (or one character)
> prefixes
> > > is a bad idea <G>.
>
> Once BasicQueryFactory has a satisfactory limitation, that is one that
> a user can understand when the exception for too many basic queries
> is thrown, there is no need to keep this minimim prefix length at 3,
> 1 or 2 will also do. When using many thousands as the max. basic queries,
> the term expansion itself might take some time to reach that maximum.
>
> You might want to ask the PM for a reasonable query involving such short
> prefixes, though. In most western languages, they do not make much sense.
>
> > >
> > > > I'm a bit confused. It appears that TooManyBooleanClauses is
> orthogonal
> > > to Surround queries. That is, trying RegexSpanQuery doesn't want to
> work
> at
> > > all with the same search clause, as it runs out of memory pretty
> > > quickly......
> > >
> > > However, working with three-letter prefixes is blazingly fast.........
>
> Your index is probably not very large (yet). Make sure to reevaluate
> the max. number of basic queries as it grows.
>
> Did you try nesting like this:
> 20d( 4w(lucene, action), 5d(hatch*, gospod*))
> ?
>
> Could you tell a bit more about the target grammar?
>
> Regards,
> Paul Elschot
>
>
> > >
> > > Thanks again...
> > >
> > > Erick
> > >
> > > On 10/6/06, Paul Elschot < [hidden email]> wrote:
> > > >
> > > > Mark,
> > > >
> > > > On Friday 06 October 2006 22:46, Mark Miller wrote:
> > > > > Paul's parser is beyond my feeble comprehension...but I would
> start by
> > > > > looking at SrndTruncQuery. It looks to me like this enumerates
> each
> > > > > possible match just like a SpanRegexQuery does...I am too lazy to
> > > > figure
> > > > > out what the visitor pattern is doing so I don't know if they then
> get
> > > > > added to a boolean query, but I don't know what else would happen.
> If
> > > >
> > > > They can also be added to a SpanOrQuery as SpanTermQuery,
> > > > this depends on the context of the query (distance query or not).
> > > > The visitor pattern is used to have the same code for distance
> queries
> > > > and other queries as far as possible.
> > > >
> > > > > this is the case, I am wondering if it is any more efficient than
> the
> > > > > SpanRegex implementation...which could be changed to a
> SpanWildcard
> > > >
> > > > I don't think the surround implementation of expanding terms is more
> > > > efficient that the Lucene implementation.
> > > > Surround does have the functionality of a SpanWildCard, but
> > > > the implementation of the expansion is shared, see above.
> > > >
> > > > > implementation. How exactly is this better at avoiding a
> > > > toomanyclauses
> > > > > exception or ram fillup. Is it just the fact that the (lets say)
> three
> > > >
> > > > > wildcard terms are anded so this should dramatically reduce the
> > > > matches?
> > > >
> > > > The limitation in BasicQueryFactory works for a complete surround
> query,
> > > > which can be nested.
> > > > In Lucene only the max nr of clauses for a single level BooleanQuery
> > > > can be controlled.
> > > >
> > > > >...
> > > >
> > > > Regards,
> > > > Paul Elschot
> > > >
> > > >
> > > > > - Mark
> > > > >
> > > > > Erick Erickson wrote:
> > > > > > Paul:
> > > > > >
> > > > > > Splendid! Now if I just understood a single thing about the
> > > > SrndQuery
> > > > > > family
> > > > > > <G>.
> > > > > >
> > > > > > I followed your link, and took a look at the text file. That
> should
> > > > > > give me
> > > > > > enough to get started.
> > > > > >
> > > > > > But if you wanted to e-mail me any sample code or long
> explanations
> > > > of
> > > > > > what
> > > > > > this all does, I would forever be your lackey <G>....
> > > > > >
> > > > > > I should also fairly easily be able to run a few of these
> against
> > > > the
> > > > > > partial index I already have to get some sense of now it'll all
> work
> > > >
> > > > > > out in
> > > > > > my problem space. I suspect that the actual number of distinct
> terms
> > > > > > won't
> > > > > > grow too much after the first 4,000 books, so it'll probably be
> > > > pretty
> > > > > > safe
> > > > > > to get this running in the "worst case", find out if/where
> things
> > > > blow
> > > > > > up,
> > > > > > and put in some safeguards. Or perhaps discover that it's
> completely
> > > > and
> > > > > > entirely perfect <G>.
> > > > > >
> > > > > > Thanks again
> > > > > > Erick
> > > > > >
> > > > > > On 10/6/06, Paul Elschot <[hidden email]> wrote:
> > > > > >>
> > > > > >> On Friday 06 October 2006 14:37, Erick Erickson wrote:
> > > > > >> ...
> > > > > >> > Fortunately, the PM agrees that it's silly to think about
> span
> > > > queries
> > > > > >> > involving OR or NOT for this app. So I'm left with something
> like
> > > > Jo*n
> > > > > >> AND
> > > > > >> > sm*th AND jon?es WITHIN 6.
> > > > > >>
> > > > > >> OR works much the same as term expansion for wildcards.
> > > > > >>
> > > > > >> > The only approach that's occurred to me is to create a filter
> on
> > > > > >> for the
> > > > > >> > terms, giving me a subset of my docs that have any terms
> > > > satisfying
> > > > > >> the
> > > > > >> > above. For each doc in the filter, get creative with
> > > > > >> TermPositionVector
> > > > > >> for
> > > > > >> > determining whether the document matches. It seems that this
> > > > would
> > > > > >> involve
> > > > > >> > creating a list of all positions in each doc in my filter
> that
> > > > match
> > > > > >> jo*n,
> > > > > >> > another for sm*th, and another for jon?es and seeing if the
> > > > distance
> > > > > >> > (however I define that) between any triple of terms (one from
> > > > each
> > > > > >> list)
> > > > > >> is
> > > > > >> > less than 6.
> > > > > >>
> > > > > >> > My gut feel is that this explodes time-wise based upon the
> number
> > > > of
> > > > > >> terms
> > > > > >> > that match. In this particular application, we are indexing
> 20K
> > > > books.
> > > > > >> Based
> > > > > >> > on indexing 4K of them, this amounts to about a 4G index
> > > > (although I
> > > > > >> > acutally expect this to be somewhat larger since I haven't
> > > > indexed all
> > > > > >> the
> > > > > >> > fields, just the text so far). I can't imagine that comparing
> the
> > > > > >> expanded
> > > > > >> > terms for, say, 10,000 docs will be fast. I'm putting
> together an
> > > > > >> experiment
> > > > > >> > to test this though.
> > > > > >> >
> > > > > >> > But someone could save me a lot of work by telling me that
> this
> > > > is
> > > > > >> solved
> > > > > >> > already. This is your chance <G>......
> > > > > >>
> > > > > >> It's solved :) here:
> > > > > >>
> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
> > > > > >>
> > > > > >> The surround query language uses only the spans package for
> > > > > >> WITHIN like queries, no filters.
> > > > > >> You may not want to use the parser, but all the rest could be
> > > > handy.
> > > > > >>
> > > > > >> > The expanding queries (e.g. PrefixQuery, RegexQuery,
> > > > WildcardQuery)
> > > > > >> all
> > > > > >> blow
> > > > > >> > up with TooManyClauses, and I've tried upping the MaxClauses
> > > > field but
> > > > > >> that
> > > > > >> > takes forever and *then* blows up. Even with -Xmx set as high
> as
> > > > I
> > > > > >> can.
> > > > > >>
> > > > > >> The surround language has its own limitation on the maximum
> number
> > > > > >> of terms expanded for wildcards, and it works nicely even for
> > > > rather
> > > > > >> high numbers of terms (thousands) for WITHIN like queries,
> > > > > >> given enough RAM.
> > > > > >>
> > > > > >> It shouldn't be too difficult to add NOT queries within WITHIN,
> > > > > >> there already is a SpanNotQuery in Lucene to map onto.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Paul Elschot
> > > > > >>
> > > > > >>
> > > >
> ---------------------------------------------------------------------
> > > > > >> To unsubscribe, e-mail: [hidden email]
> > > > > >> For additional commands, e-mail:
> [hidden email]
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [hidden email]
> > > > > For additional commands, e-mail: [hidden email]
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Doron Cohen
"Erick Erickson" <[hidden email]> wrote on 09/10/2006 13:09:21:
> ... The kicker is that what we are indexing is
> OCR data, some of which is pretty trashy. So you wind up with
"interesting"
> words in your index, things like rtyHrS. So the whole question of
allowing
> very specific queries on detailed wildcards (combined with spans) is
under
> discussion. It's not at all clear to me that there's any value to the end
> users in the capability of, say, two character prefixes. And, it's an
easy
> rule that "prefix queries must specify at least 3 non-wildcard
> characters"....

Erick, I may be out of course here, but, fwiw, have you considered n-gram
indexing/search for a degree of fuzziness to compensate for OCR errors..?

For a four words query you would probably get ~20 tokens (bigrams?) - no
matter what the index size is. You would then probably want to score higher
by LA (lexical affinity - query terms appear close to each other in the
document) - and I am not sure to what degree a span query (made of n-gram
terms) would serve that, because (1) all terms in the span need to be there
(well, I think:-); and, (2) you would like to increase doc score for
close-by terms only for close-by query n-grams.

So there might not be a ready to use solution in Lucene for this, but
perhaps this is a more robust direction to try than the wild card approach
- I mean, if users want to type a wild card query, it is their right to do
so, but for an application logic this does not seem the best choice.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Erick Erickson
Doron:

Thanks for the suggestion, I'll certainly put it on my list, depending upon
what the PM decides. This app is geneaology reasearch, and users *can* put
in their own wildcards...

This is why I love this list... lots of smart people giving me suggestions I
never would have thought of <G>...

Thanks
Erick

On 10/9/06, Doron Cohen <[hidden email]> wrote:

>
> "Erick Erickson" <[hidden email]> wrote on 09/10/2006 13:09:21:
> > ... The kicker is that what we are indexing is
> > OCR data, some of which is pretty trashy. So you wind up with
> "interesting"
> > words in your index, things like rtyHrS. So the whole question of
> allowing
> > very specific queries on detailed wildcards (combined with spans) is
> under
> > discussion. It's not at all clear to me that there's any value to the
> end
> > users in the capability of, say, two character prefixes. And, it's an
> easy
> > rule that "prefix queries must specify at least 3 non-wildcard
> > characters"....
>
> Erick, I may be out of course here, but, fwiw, have you considered n-gram
> indexing/search for a degree of fuzziness to compensate for OCR errors..?
>
> For a four words query you would probably get ~20 tokens (bigrams?) - no
> matter what the index size is. You would then probably want to score
> higher
> by LA (lexical affinity - query terms appear close to each other in the
> document) - and I am not sure to what degree a span query (made of n-gram
> terms) would serve that, because (1) all terms in the span need to be
> there
> (well, I think:-); and, (2) you would like to increase doc score for
> close-by terms only for close-by query n-grams.
>
> So there might not be a ready to use solution in Lucene for this, but
> perhaps this is a more robust direction to try than the wild card approach
>
> - I mean, if users want to type a wild card query, it is their right to do
> so, but for an application logic this does not seem the best choice.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Erick Erickson
Problem 3482:

I'm probably close to being able to start work. Except...

How to count hits with SrndQuery? Or, more generally, with arbitrary
wildcards and boolean operators?

So, say I've indexed a book by page. That is, each page is a document. I
know a particular page matches my query because the SrndQuery found it. Now,
I want to answer the question "How many times did the query match on this
page"?

For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k?i?f,
h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding a not or
three, a nested pair of OR clauses and..... No, DON'T tell me that that
probably wouldn't match any pages anyway <G>....

Anyway, I want to answer "how many times does this occur on the page".
Another way of asking this, I suppose, is "how many terms would be
highlighted on that page", but I don't think highlighting helps. And I'm
aware that the question "how many times does 'this' occur" is ambiguous,
especially when we add the not case in......

I can think of a couple of approaches:
1> get down and dirty with the terms. That is, examine the term position
vectors and compare all the nasty details of where they occur, combined
with, say, regextermenum and go at it. This is fairly ugly, especially with
nested queries. But I can do it especially if we limit the complexity of the
query or define the hitcount more simply.
2> get clever with a regex, fetch the text of the page and see how many
times the regex matches. I'd imagine that the regex will
be...er...unpleasant.
2a> Use simpler regex expressions for each term, assemble the list of match
positions, and count.
2b> Isn't this really just using TermDocs as it was meant to be used?
combined with regextermenum?
2c> Since the number of regex matches on a particular page is much smaller
than the number of regex matches over the entire index, does anyone have a
feel for whether <2a> or <2b> is easier/faster? For <2a>, I'm analyzing a
page with a regex. For <2b>, Lucene has already done the pattern matching,
but I'm reading a bunch of different termdocs......

Fortunately, for this application, I only care about the hits per page for a
single book at a time. I do NOT have to create a list of all hits on all
pages for all books that have any match.

Thanks
Erick

On 10/9/06, Erick Erickson <[hidden email]> wrote:

>
> Doron:
>
> Thanks for the suggestion, I'll certainly put it on my list, depending
> upon what the PM decides. This app is geneaology reasearch, and users
> *can* put in their own wildcards...
>
> This is why I love this list... lots of smart people giving me suggestions
> I never would have thought of <G>...
>
> Thanks
> Erick
>
> On 10/9/06, Doron Cohen < [hidden email]> wrote:
> >
> > "Erick Erickson" <[hidden email]> wrote on 09/10/2006 13:09:21:
> > > ... The kicker is that what we are indexing is
> > > OCR data, some of which is pretty trashy. So you wind up with
> > "interesting"
> > > words in your index, things like rtyHrS. So the whole question of
> > allowing
> > > very specific queries on detailed wildcards (combined with spans) is
> > under
> > > discussion. It's not at all clear to me that there's any value to the
> > end
> > > users in the capability of, say, two character prefixes. And, it's an
> > easy
> > > rule that "prefix queries must specify at least 3 non-wildcard
> > > characters"....
> >
> > Erick, I may be out of course here, but, fwiw, have you considered
> > n-gram
> > indexing/search for a degree of fuzziness to compensate for OCR
> > errors..?
> >
> > For a four words query you would probably get ~20 tokens (bigrams?) - no
> > matter what the index size is. You would then probably want to score
> > higher
> > by LA (lexical affinity - query terms appear close to each other in the
> > document) - and I am not sure to what degree a span query (made of
> > n-gram
> > terms) would serve that, because (1) all terms in the span need to be
> > there
> > (well, I think:-); and, (2) you would like to increase doc score for
> > close-by terms only for close-by query n-grams.
> >
> > So there might not be a ready to use solution in Lucene for this, but
> > perhaps this is a more robust direction to try than the wild card
> > approach
> > - I mean, if users want to type a wild card query, it is their right to
> > do
> > so, but for an application logic this does not seem the best choice.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Erik Hatcher
Erick - what about using getSpans() from the SpanQuery that is  
generated?   That should give you what you're after I think.

        Erik


On Oct 11, 2006, at 2:17 PM, Erick Erickson wrote:

> Problem 3482:
>
> I'm probably close to being able to start work. Except...
>
> How to count hits with SrndQuery? Or, more generally, with arbitrary
> wildcards and boolean operators?
>
> So, say I've indexed a book by page. That is, each page is a  
> document. I
> know a particular page matches my query because the SrndQuery found  
> it. Now,
> I want to answer the question "How many times did the query match  
> on this
> page"?
>
> For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k?
> i?f,
> h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding a  
> not or
> three, a nested pair of OR clauses and..... No, DON'T tell me that  
> that
> probably wouldn't match any pages anyway <G>....
>
> Anyway, I want to answer "how many times does this occur on the page".
> Another way of asking this, I suppose, is "how many terms would be
> highlighted on that page", but I don't think highlighting helps.  
> And I'm
> aware that the question "how many times does 'this' occur" is  
> ambiguous,
> especially when we add the not case in......
>
> I can think of a couple of approaches:
> 1> get down and dirty with the terms. That is, examine the term  
> position
> vectors and compare all the nasty details of where they occur,  
> combined
> with, say, regextermenum and go at it. This is fairly ugly,  
> especially with
> nested queries. But I can do it especially if we limit the  
> complexity of the
> query or define the hitcount more simply.
> 2> get clever with a regex, fetch the text of the page and see how  
> many
> times the regex matches. I'd imagine that the regex will
> be...er...unpleasant.
> 2a> Use simpler regex expressions for each term, assemble the list  
> of match
> positions, and count.
> 2b> Isn't this really just using TermDocs as it was meant to be used?
> combined with regextermenum?
> 2c> Since the number of regex matches on a particular page is much  
> smaller
> than the number of regex matches over the entire index, does anyone  
> have a
> feel for whether <2a> or <2b> is easier/faster? For <2a>, I'm  
> analyzing a
> page with a regex. For <2b>, Lucene has already done the pattern  
> matching,
> but I'm reading a bunch of different termdocs......
>
> Fortunately, for this application, I only care about the hits per  
> page for a
> single book at a time. I do NOT have to create a list of all hits  
> on all
> pages for all books that have any match.
>
> Thanks
> Erick
>
> On 10/9/06, Erick Erickson <[hidden email]> wrote:
>>
>> Doron:
>>
>> Thanks for the suggestion, I'll certainly put it on my list,  
>> depending
>> upon what the PM decides. This app is geneaology reasearch, and users
>> *can* put in their own wildcards...
>>
>> This is why I love this list... lots of smart people giving me  
>> suggestions
>> I never would have thought of <G>...
>>
>> Thanks
>> Erick
>>
>> On 10/9/06, Doron Cohen < [hidden email]> wrote:
>> >
>> > "Erick Erickson" <[hidden email]> wrote on 09/10/2006  
>> 13:09:21:
>> > > ... The kicker is that what we are indexing is
>> > > OCR data, some of which is pretty trashy. So you wind up with
>> > "interesting"
>> > > words in your index, things like rtyHrS. So the whole question of
>> > allowing
>> > > very specific queries on detailed wildcards (combined with  
>> spans) is
>> > under
>> > > discussion. It's not at all clear to me that there's any value  
>> to the
>> > end
>> > > users in the capability of, say, two character prefixes. And,  
>> it's an
>> > easy
>> > > rule that "prefix queries must specify at least 3 non-wildcard
>> > > characters"....
>> >
>> > Erick, I may be out of course here, but, fwiw, have you considered
>> > n-gram
>> > indexing/search for a degree of fuzziness to compensate for OCR
>> > errors..?
>> >
>> > For a four words query you would probably get ~20 tokens  
>> (bigrams?) - no
>> > matter what the index size is. You would then probably want to  
>> score
>> > higher
>> > by LA (lexical affinity - query terms appear close to each other  
>> in the
>> > document) - and I am not sure to what degree a span query (made of
>> > n-gram
>> > terms) would serve that, because (1) all terms in the span need  
>> to be
>> > there
>> > (well, I think:-); and, (2) you would like to increase doc score  
>> for
>> > close-by terms only for close-by query n-grams.
>> >
>> > So there might not be a ready to use solution in Lucene for  
>> this, but
>> > perhaps this is a more robust direction to try than the wild card
>> > approach
>> > - I mean, if users want to type a wild card query, it is their  
>> right to
>> > do
>> > so, but for an application logic this does not seem the best  
>> choice.
>> >
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>> >
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Paul Elschot
On Wednesday 11 October 2006 20:30, Erik Hatcher wrote:
> Erick - what about using getSpans() from the SpanQuery that is  
> generated?   That should give you what you're after I think.
>
> Erik

You can also use skipTo(docNr) on the spans to skip to the docNr
of the book that you're after. A Filter for the single book would also
work, but using skipTo() yourself on the spans is easier.

Regards,
Paul Elschot


>
>
> On Oct 11, 2006, at 2:17 PM, Erick Erickson wrote:
>
> > Problem 3482:
> >
> > I'm probably close to being able to start work. Except...
> >
> > How to count hits with SrndQuery? Or, more generally, with arbitrary
> > wildcards and boolean operators?
> >
> > So, say I've indexed a book by page. That is, each page is a  
> > document. I
> > know a particular page matches my query because the SrndQuery found  
> > it. Now,
> > I want to answer the question "How many times did the query match  
> > on this
> > page"?
> >
> > For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k?
> > i?f,
> > h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding a  
> > not or
> > three, a nested pair of OR clauses and..... No, DON'T tell me that  
> > that
> > probably wouldn't match any pages anyway <G>....
> >
> > Anyway, I want to answer "how many times does this occur on the page".
> > Another way of asking this, I suppose, is "how many terms would be
> > highlighted on that page", but I don't think highlighting helps.  
> > And I'm
> > aware that the question "how many times does 'this' occur" is  
> > ambiguous,
> > especially when we add the not case in......
> >
> > I can think of a couple of approaches:
> > 1> get down and dirty with the terms. That is, examine the term  
> > position
> > vectors and compare all the nasty details of where they occur,  
> > combined
> > with, say, regextermenum and go at it. This is fairly ugly,  
> > especially with
> > nested queries. But I can do it especially if we limit the  
> > complexity of the
> > query or define the hitcount more simply.
> > 2> get clever with a regex, fetch the text of the page and see how  
> > many
> > times the regex matches. I'd imagine that the regex will
> > be...er...unpleasant.
> > 2a> Use simpler regex expressions for each term, assemble the list  
> > of match
> > positions, and count.
> > 2b> Isn't this really just using TermDocs as it was meant to be used?
> > combined with regextermenum?
> > 2c> Since the number of regex matches on a particular page is much  
> > smaller
> > than the number of regex matches over the entire index, does anyone  
> > have a
> > feel for whether <2a> or <2b> is easier/faster? For <2a>, I'm  
> > analyzing a
> > page with a regex. For <2b>, Lucene has already done the pattern  
> > matching,
> > but I'm reading a bunch of different termdocs......
> >
> > Fortunately, for this application, I only care about the hits per  
> > page for a
> > single book at a time. I do NOT have to create a list of all hits  
> > on all
> > pages for all books that have any match.
> >
> > Thanks
> > Erick
> >
> > On 10/9/06, Erick Erickson <[hidden email]> wrote:
> >>
> >> Doron:
> >>
> >> Thanks for the suggestion, I'll certainly put it on my list,  
> >> depending
> >> upon what the PM decides. This app is geneaology reasearch, and users
> >> *can* put in their own wildcards...
> >>
> >> This is why I love this list... lots of smart people giving me  
> >> suggestions
> >> I never would have thought of <G>...
> >>
> >> Thanks
> >> Erick
> >>
> >> On 10/9/06, Doron Cohen < [hidden email]> wrote:
> >> >
> >> > "Erick Erickson" <[hidden email]> wrote on 09/10/2006  
> >> 13:09:21:
> >> > > ... The kicker is that what we are indexing is
> >> > > OCR data, some of which is pretty trashy. So you wind up with
> >> > "interesting"
> >> > > words in your index, things like rtyHrS. So the whole question of
> >> > allowing
> >> > > very specific queries on detailed wildcards (combined with  
> >> spans) is
> >> > under
> >> > > discussion. It's not at all clear to me that there's any value  
> >> to the
> >> > end
> >> > > users in the capability of, say, two character prefixes. And,  
> >> it's an
> >> > easy
> >> > > rule that "prefix queries must specify at least 3 non-wildcard
> >> > > characters"....
> >> >
> >> > Erick, I may be out of course here, but, fwiw, have you considered
> >> > n-gram
> >> > indexing/search for a degree of fuzziness to compensate for OCR
> >> > errors..?
> >> >
> >> > For a four words query you would probably get ~20 tokens  
> >> (bigrams?) - no
> >> > matter what the index size is. You would then probably want to  
> >> score
> >> > higher
> >> > by LA (lexical affinity - query terms appear close to each other  
> >> in the
> >> > document) - and I am not sure to what degree a span query (made of
> >> > n-gram
> >> > terms) would serve that, because (1) all terms in the span need  
> >> to be
> >> > there
> >> > (well, I think:-); and, (2) you would like to increase doc score  
> >> for
> >> > close-by terms only for close-by query n-grams.
> >> >
> >> > So there might not be a ready to use solution in Lucene for  
> >> this, but
> >> > perhaps this is a more robust direction to try than the wild card
> >> > approach
> >> > - I mean, if users want to type a wild card query, it is their  
> >> right to
> >> > do
> >> > so, but for an application logic this does not seem the best  
> >> choice.
> >> >
> >> >
> >> >  
> >> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [hidden email]
> >> > For additional commands, e-mail: [hidden email]
> >> >
> >> >
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: wildcard and span queries

Erick Erickson
I thought I'd update folks on the continuing saga. Many thanks to all who've
contributed to my education.

Here's our current resolution;
It turns out that the PM will cope with restricting wildcards two ways.
1> there must be at least 3 non-wildcard characters
2> wildcards cannot appear in the first position.

It's really a variant on a question that Paul asked, "how useful are queries
that match, say, 250,000 terms?"

With the above restrictions, I'm getting a much smaller number of terms than
250,000, and I can probably use the surround or span family of queries to
get me everything I want without having to write as much code as I was
afraid of.

Thanks again.

Erick

On 10/11/06, Paul Elschot <[hidden email]> wrote:

>
> On Wednesday 11 October 2006 20:30, Erik Hatcher wrote:
> > Erick - what about using getSpans() from the SpanQuery that is
> > generated?   That should give you what you're after I think.
> >
> >       Erik
>
> You can also use skipTo(docNr) on the spans to skip to the docNr
> of the book that you're after. A Filter for the single book would also
> work, but using skipTo() yourself on the spans is easier.
>
> Regards,
> Paul Elschot
>
>
> >
> >
> > On Oct 11, 2006, at 2:17 PM, Erick Erickson wrote:
> >
> > > Problem 3482:
> > >
> > > I'm probably close to being able to start work. Except...
> > >
> > > How to count hits with SrndQuery? Or, more generally, with arbitrary
> > > wildcards and boolean operators?
> > >
> > > So, say I've indexed a book by page. That is, each page is a
> > > document. I
> > > know a particular page matches my query because the SrndQuery found
> > > it. Now,
> > > I want to answer the question "How many times did the query match
> > > on this
> > > page"?
> > >
> > > For a Srnd query of, say, the form "20w(e?ci, ma?l, k?nd, m??ik?, k?
> > > i?f,
> > > h?n???d, co?e, ca??l?, r????o?e, cl?p, ho???k?)". Imagine adding a
> > > not or
> > > three, a nested pair of OR clauses and..... No, DON'T tell me that
> > > that
> > > probably wouldn't match any pages anyway <G>....
> > >
> > > Anyway, I want to answer "how many times does this occur on the page".
> > > Another way of asking this, I suppose, is "how many terms would be
> > > highlighted on that page", but I don't think highlighting helps.
> > > And I'm
> > > aware that the question "how many times does 'this' occur" is
> > > ambiguous,
> > > especially when we add the not case in......
> > >
> > > I can think of a couple of approaches:
> > > 1> get down and dirty with the terms. That is, examine the term
> > > position
> > > vectors and compare all the nasty details of where they occur,
> > > combined
> > > with, say, regextermenum and go at it. This is fairly ugly,
> > > especially with
> > > nested queries. But I can do it especially if we limit the
> > > complexity of the
> > > query or define the hitcount more simply.
> > > 2> get clever with a regex, fetch the text of the page and see how
> > > many
> > > times the regex matches. I'd imagine that the regex will
> > > be...er...unpleasant.
> > > 2a> Use simpler regex expressions for each term, assemble the list
> > > of match
> > > positions, and count.
> > > 2b> Isn't this really just using TermDocs as it was meant to be used?
> > > combined with regextermenum?
> > > 2c> Since the number of regex matches on a particular page is much
> > > smaller
> > > than the number of regex matches over the entire index, does anyone
> > > have a
> > > feel for whether <2a> or <2b> is easier/faster? For <2a>, I'm
> > > analyzing a
> > > page with a regex. For <2b>, Lucene has already done the pattern
> > > matching,
> > > but I'm reading a bunch of different termdocs......
> > >
> > > Fortunately, for this application, I only care about the hits per
> > > page for a
> > > single book at a time. I do NOT have to create a list of all hits
> > > on all
> > > pages for all books that have any match.
> > >
> > > Thanks
> > > Erick
> > >
> > > On 10/9/06, Erick Erickson <[hidden email]> wrote:
> > >>
> > >> Doron:
> > >>
> > >> Thanks for the suggestion, I'll certainly put it on my list,
> > >> depending
> > >> upon what the PM decides. This app is geneaology reasearch, and users
> > >> *can* put in their own wildcards...
> > >>
> > >> This is why I love this list... lots of smart people giving me
> > >> suggestions
> > >> I never would have thought of <G>...
> > >>
> > >> Thanks
> > >> Erick
> > >>
> > >> On 10/9/06, Doron Cohen < [hidden email]> wrote:
> > >> >
> > >> > "Erick Erickson" <[hidden email]> wrote on 09/10/2006
> > >> 13:09:21:
> > >> > > ... The kicker is that what we are indexing is
> > >> > > OCR data, some of which is pretty trashy. So you wind up with
> > >> > "interesting"
> > >> > > words in your index, things like rtyHrS. So the whole question of
> > >> > allowing
> > >> > > very specific queries on detailed wildcards (combined with
> > >> spans) is
> > >> > under
> > >> > > discussion. It's not at all clear to me that there's any value
> > >> to the
> > >> > end
> > >> > > users in the capability of, say, two character prefixes. And,
> > >> it's an
> > >> > easy
> > >> > > rule that "prefix queries must specify at least 3 non-wildcard
> > >> > > characters"....
> > >> >
> > >> > Erick, I may be out of course here, but, fwiw, have you considered
> > >> > n-gram
> > >> > indexing/search for a degree of fuzziness to compensate for OCR
> > >> > errors..?
> > >> >
> > >> > For a four words query you would probably get ~20 tokens
> > >> (bigrams?) - no
> > >> > matter what the index size is. You would then probably want to
> > >> score
> > >> > higher
> > >> > by LA (lexical affinity - query terms appear close to each other
> > >> in the
> > >> > document) - and I am not sure to what degree a span query (made of
> > >> > n-gram
> > >> > terms) would serve that, because (1) all terms in the span need
> > >> to be
> > >> > there
> > >> > (well, I think:-); and, (2) you would like to increase doc score
> > >> for
> > >> > close-by terms only for close-by query n-grams.
> > >> >
> > >> > So there might not be a ready to use solution in Lucene for
> > >> this, but
> > >> > perhaps this is a more robust direction to try than the wild card
> > >> > approach
> > >> > - I mean, if users want to type a wild card query, it is their
> > >> right to
> > >> > do
> > >> > so, but for an application logic this does not seem the best
> > >> choice.
> > >> >
> > >> >
> > >> >
> > >> ---------------------------------------------------------------------
> > >> > To unsubscribe, e-mail: [hidden email]
> > >> > For additional commands, e-mail: [hidden email]
> > >> >
> > >> >
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>