GetMoreDocs question

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

GetMoreDocs question

Marcus Falck
Hi,

 

I have some questions regarding the GetMoreDocs(50) call in the
constructors of the Hits class.

 

First off whats the purposes of this call?

 

Why do I have to make this call if I only want to get out a count of the
matching documents and don't want to reterive any document from the
index?

Can I do something so I don't cache up 100 docs when I'm just asking for
a count?

 

/

Regards

Marcus

Reply | Threaded
Open this post in threaded view
|

Re: GetMoreDocs question

Erick Erickson
See below...

On 8/31/06, Marcus Falck <[hidden email]> wrote:
>
> Hi,
>
>
>
> I have some questions regarding the GetMoreDocs(50) call in the
> constructors of the Hits class.


What constructors? I just get a Hits object returned from the Searcher. Or
are you looking in the source?


First off whats the purposes of this call?


I'll pass on this since I don't know.


Why do I have to make this call if I only want to get out a count of the
> matching documents and don't want to reterive any document from the
> index?


You don't. Just look at the Hits.length()


Can I do something so I don't cache up 100 docs when I'm just asking for
> a count?


Why do you care? Is this a demonstrable performance issue? Premature
optimization and all that...

I don't know any way of getting around this, but since this hasn't been a
performance issue for me, I haven't
looked very hard.

And I assume you're counting the results of a query, in which case getting
around the cached documents sounds like way more work than it's worth.

If you're asking how many documents are in the index, that's another issue.
One technique I've used is to keep a "statistics document" in the index with
fields orthogonal to the fields in my "regular" documents (so no queries
match) with summary statistics that I assemble while making the index. Then,
when I want meta-information, I can just query that specific document and
read it.

Best
Erick


/
>
> Regards
>
> Marcus
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: GetMoreDocs question

Chris Hostetter-3
In reply to this post by Marcus Falck

: I have some questions regarding the GetMoreDocs(50) call in the
: constructors of the Hits class.

: First off whats the purposes of this call?

Hits is designed to meet the simple needs of simple clients -- the
assumption is that clients using Hits want simple paginated results - so
Hits goes ahead and gets you page#1.

: Can I do something so I don't cache up 100 docs when I'm just asking for
: a count?

This is what the search method that returns TopDocs is designed for --
when all you care about is the totalHits and the first N docs (in your
case N is 0 or maybe 1)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Proximity Query Parser

Mark Miller-3
I am not a huge fan of the queryparser's syntax so I have started an
open source project to create a viable alternative. I could really use
some helping testing it out. The more I can get it tested the better
chance it has of serving the community. The parser is called Qsol. I am
right up against its initial release. So far it:

offers a simple clean syntax.
allows arbitrary combinations/nesting of proximity and boolean queries.
allows special date field processing (date searches can use a
constantscore range filter).
other minor features ( like makeAllTermsFuzzy() to make your standard
search a fuzzy search (would prob be god awful slow I know, but I have
seen this option in MediaWare I think).

The first initial release (if I can get some people to take the plunge
and help me test) will also include sentence/paragraph proximity search
support and a goggle suggest/spell-check type function. I have roughly
implemented both of these, but have not combined them into the parser yet.

I have set up a rough page with some sparse documentation for the parser
at http://famestalker.com/devwiki/ You can download the jar there.

A general query parser is such a pluggable part of Lucene that it would
be really nice to have a few viable options. It seems that everyone that
makes one keeps it proprietary (other than Surround). Help me push this
thing to a 1.0 release! It is almost there. Try it out! Keep in mind,
there are probably plenty of optimizations to be had in the future.

Below is a simple syntax explanation and some sample queries.

- Mark Miller


    Order of Operations

   1. '( )' *parenthesis* : me & (him | her)
   2. '!' *and not* : mill ! bucketloader
   3. '~' *within* : score ~5 lunch : use ord to only find terms in
      order : score ord~5 lunch
   4. '&' *and* : beat & pony
   5. '|' *or* : him | her

Spaces between terms default to & but this can be changed to |

*Escape* - A '\' will escape an operator : m\&m's

*Quotes* - an in-order phrase search with or without a specified slop :
"holy war sick":3 | "gimme all my cake"

*Range Queries* - a query in the form: /beingword - endword/ will
perform a range search. The default search is inclusive. For an
exclusive search use '--' instead of '-' : creditcard[23907094 -
23094345] | creditcard[23907094 -- 23094345]

*Wildcards* - * indicates zero or more unknowns and ? indicates a single
unknown : old harr*t?n | kil?r

A wildcard query cannot begin with an unknown.

*Fuzzy Query* : a ` indicates the preceding term should be a fuzzy
term : old carrot & devil` may cry

*Paragraph/Sentence Proximity Searching*

If you have enabled sentence and paragraph proximity searching then the
'~' operator may also be used as '~3p' or '~5s' to perform paragraph and
sentence proximity searches.

*Sample Queries:*

        example = "(good witch & "killa the willaw") ~4 scary ! man";
        expected = "+(+spanNear([allFields:good, allFields:scary], 4,
false) -spanNear([allFields:good, allFields:man], 4, false))
+(+spanNear([allFields:witch, allFields:scary], 4, false)
-spanNear([allFields:witch, allFields:man], 4, false))
+(+spanNear([spanNear([allFields:killa, allFields:willaw], 1, true),
allFields:scary], 4, false) -spanNear([spanNear([allFields:killa,
allFields:willaw], 1, true), allFields:man], 4, false))";
        assertEquals(expected, parse(example));
       
        example = "beat` old magpie`";
        expected = "+allFields:beat~0.5 +allFields:old
+allFields:magpie~0.5";
        assertEquals(expected, parse(example));

        example = "me \| the & test & hole";
        expected = "+allFields:me +allFields:test +allFields:hole";
        assertEquals(expected, parse(example));

        example = ""test the big search":30 & me";
        expected = "+spanNear([allFields:test, allFields:big,
allFields:search], 30, true) +allFields:me";
        assertEquals(expected, parse(example));

        example = "me & fox & cop";
        expected = "+allFields:me +allFields:fox +allFields:cop";
        assertEquals(expected, parse(example));
       
        example = "date[8/5/82]";
        expected = "date:19820805";
        assertEquals(expected, parse(example));

        example = "date[> 12/31/02]";
        expected = "ConstantScore(date:[20021231-})";
        assertEquals(expected, parse(example));

        example = "date[< 03/23/2004]";
        expected = "ConstantScore(date:{-20040323])";
        assertEquals(expected, parse(example));
       
        example = "date[3/23/2004 - 6/34/02]";
        expected = "ConstantScore(date:[20040323-20020704])";
        assertEquals(expected, parse(example));
       
        example = "field1,field2[(search & old) ~3 horse]";
        expected = "(+spanNear([field1:search, field1:horse], 3, false)
+spanNear([field1:old, field1:horse], 3, false))
(+spanNear([field2:search, field2:horse], 3, false)
+spanNear([field2:old, field2:horse], 3, false))";
        assertEquals(expected, parse(example));

        example = "field1[search | old ~3 horse]";
        expected = "(field1:search spanNear([field1:old, field1:horse],
3, false))";
        assertEquals(expected, parse(example));

        parser.makeAllTermsFuzzy(true);
        example = "meat & old cleaver | mike ~3 (dirty man)";
        expected = "(+allFields:meat~0.5 +allFields:old~0.5
+allFields:cleaver~0.5) (+spanNear([fuzzy(allFields:mike),
fuzzy(allFields:dirty)], 3, false) +spanNear([fuzzy(allFields:mike),
fuzzy(allFields:man)], 3, false))";
        assertEquals(expected, parse(example));
        parser.makeAllTermsFuzzy(false);
       
        example = "goat-valley";
        expected = "spanNear([allFields:goat, allFields:valley], 1, true)";
        assertEquals(expected, parse(example));
       
        example = "goat -- valley";
        expected = "allFields:[goat TO valley]";
        assertEquals(expected, parse(example));
       
        example = "goat \\-- valley";
        expected = "+allFields:goat +allFields:valley";
        assertEquals(expected, parse(example));
       
        example = "goat \\- valley";
        expected = "+allFields:goat +allFields:valley";
        assertEquals(expected, parse(example));
       
        example = "goat - valley";
        expected = "allFields:{goat TO valley}";
        assertEquals(expected, parse(example));




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proximity Query Parser

Paul Elschot
Mark,

On Thursday 31 August 2006 23:18, Mark Miller wrote:
> I am not a huge fan of the queryparser's syntax so I have started an
> open source project to create a viable alternative. I could really use
> some helping testing it out. The more I can get it tested the better
> chance it has of serving the community. The parser is called Qsol. I am
> right up against its initial release. So far it:
>
> offers a simple clean syntax.
> allows arbitrary combinations/nesting of proximity and boolean queries.

Could you say in a few words how the combination of proximity and boolean
is implemented in Qsol?

I found this the most difficult thing to implement in surround. In surround,
every subquery that can be a proximity subquery has two (groups of) methods:
one for use as boolean and one for use as proximity.
I'd like to have a mechanism that allows mixing proximity and boolean queries
built into Lucene.

Did you also implement parsed phrases with Lucene's PhraseQuery?
Surround does not have that.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

SV: GetMoreDocs question

Marcus Falck
In reply to this post by Marcus Falck
Thx Hoss.

But why do I have to reterive atleast 1 document when im using the TopDocs ?
(If I set nDoc to 0 it will throw an exception).

/
Marcus

-----Ursprungligt meddelande-----
Från: Chris Hostetter [mailto:[hidden email]]
Skickat: den 31 augusti 2006 20:09
Till: [hidden email]
Ämne: Re: GetMoreDocs question


: I have some questions regarding the GetMoreDocs(50) call in the
: constructors of the Hits class.

: First off whats the purposes of this call?

Hits is designed to meet the simple needs of simple clients -- the
assumption is that clients using Hits want simple paginated results - so
Hits goes ahead and gets you page#1.

: Can I do something so I don't cache up 100 docs when I'm just asking for
: a count?

This is what the search method that returns TopDocs is designed for --
when all you care about is the totalHits and the first N docs (in your
case N is 0 or maybe 1)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proximity Query Parser

Mark Miller-3
In reply to this post by Paul Elschot
Paul Elschot wrote:

> Mark,
>
> On Thursday 31 August 2006 23:18, Mark Miller wrote:
>  
>> I am not a huge fan of the queryparser's syntax so I have started an
>> open source project to create a viable alternative. I could really use
>> some helping testing it out. The more I can get it tested the better
>> chance it has of serving the community. The parser is called Qsol. I am
>> right up against its initial release. So far it:
>>
>> offers a simple clean syntax.
>> allows arbitrary combinations/nesting of proximity and boolean queries.
>>    
>
> Could you say in a few words how the combination of proximity and boolean
> is implemented in Qsol?
>
> I found this the most difficult thing to implement in surround. In surround,
> every subquery that can be a proximity subquery has two (groups of) methods:
> one for use as boolean and one for use as proximity.
> I'd like to have a mechanism that allows mixing proximity and boolean queries
> built into Lucene.
>
> Did you also implement parsed phrases with Lucene's PhraseQuery?
> Surround does not have that.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  
Hi Paul,

I'm afraid my programming is prob quite a ways behind yours so I doubt
anything I have done will be of any help to you.

I also have to treat things differently depending on if I am in a
proximity clause or boolean clause. A wildcard in a boolean is mapped to
a wildcard query. A wildcard in a proximity is mapped to a regex span
that has been modified to only deal with * and ?. When I run into a
proximity, I collect a small tree of each clause and distribute them
against each other...(old | map) ~3 big gets distributed to old ~3 big |
map ~3 big. This distribution method appears to handle all
boolean/proximity nesting/mixing cases for me, including: great ! "big
old phrase search" ~5 (holy ~4 (big black bear)). The distribution
maintains order of operations, but also obviously can create some pretty
large queries.

I did not use the phrase search because I do not like how the slop works
(not in order, etc.) so both in and out of proximity uses a nearspan
instead. For a multiphrase search I use an OrSpan on words in the same
position.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proximity Query Parser

Paul Elschot
On Friday 01 September 2006 12:54, Mark Miller wrote:

> Hi Paul,
>
> I also have to treat things differently depending on if I am in a
> proximity clause or boolean clause. A wildcard in a boolean is mapped to
> a wildcard query. A wildcard in a proximity is mapped to a regex span
> that has been modified to only deal with * and ?. When I run into a
> proximity, I collect a small tree of each clause and distribute them
> against each other...(old | map) ~3 big gets distributed to old ~3 big |
> map ~3 big. This distribution method appears to handle all

There is no need to repeat "big". SpanQueries can be nested,
so when mapping like this:
SpanNear(SpanOr( old, map), big)
the query structure will only grow for truncations and fuzzy stuff.

> boolean/proximity nesting/mixing cases for me, including: great ! "big
> old phrase search" ~5 (holy ~4 (big black bear)). The distribution
> maintains order of operations, but also obviously can create some pretty
> large queries.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proximity Query Parser

Mark Miller-3
Thanks for the tip Paul. It is embarrassing, but I only realized how OrSpan
queries worked a day or two ago based on a tip from Eric. The way I assumed
it would create the spans before was just wrong and I never had researched
further. Now I see that it would be a nice optimization for what I
have...but I have not yet looked into how easy it will be to integrate it
into my distribution algorithm. I do use it for multiphrase queries however
based on Erics tip. It will hopefully be pretty simple to apply it to my
distribution, but I have not had time to check it out. I plowed this thing
out pretty quickly and am hoping I can go back and clean up a lot of things.
Need a short break though to pump out some other things. As I learn more
about Lucene and JavaCC I will incorporate new methods into the parser.


- Mark


On 9/1/06, Paul Elschot <[hidden email]> wrote:

>
> On Friday 01 September 2006 12:54, Mark Miller wrote:
>
> > Hi Paul,
> >
> > I also have to treat things differently depending on if I am in a
> > proximity clause or boolean clause. A wildcard in a boolean is mapped to
> > a wildcard query. A wildcard in a proximity is mapped to a regex span
> > that has been modified to only deal with * and ?. When I run into a
> > proximity, I collect a small tree of each clause and distribute them
> > against each other...(old | map) ~3 big gets distributed to old ~3 big |
> > map ~3 big. This distribution method appears to handle all
>
> There is no need to repeat "big". SpanQueries can be nested,
> so when mapping like this:
> SpanNear(SpanOr( old, map), big)
> the query structure will only grow for truncations and fuzzy stuff.
>
> > boolean/proximity nesting/mixing cases for me, including: great ! "big
> > old phrase search" ~5 (holy ~4 (big black bear)). The distribution
> > maintains order of operations, but also obviously can create some pretty
> > large queries.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Proximity Query Parser

Mark Miller-3
Eric also gave me the idea of using a SpanNear with maximum slop as a
boolean to connect spans. Using this and SpanOr seems to make my time spent
on the distribution of proximity clauses a little foolish :) Is that true?
Is there any disadvantage to the max slop Spannear, SpanOr solution? Any
advantage to distributing the 'and's?

- Mark


On 9/1/06, Mark Miller <[hidden email]> wrote:

>
>  Thanks for the tip Paul. It is embarrassing, but I only realized how
> OrSpan queries worked a day or two ago based on a tip from Eric. The way I
> assumed it would create the spans before was just wrong and I never had
> researched further. Now I see that it would be a nice optimization for what
> I have...but I have not yet looked into how easy it will be to integrate it
> into my distribution algorithm. I do use it for multiphrase queries however
> based on Erics tip. It will hopefully be pretty simple to apply it to my
> distribution, but I have not had time to check it out. I plowed this thing
> out pretty quickly and am hoping I can go back and clean up a lot of things.
> Need a short break though to pump out some other things. As I learn more
> about Lucene and JavaCC I will incorporate new methods into the parser.
>
>
> - Mark
>
>
>  On 9/1/06, Paul Elschot <[hidden email]> wrote:
> >
> > On Friday 01 September 2006 12:54, Mark Miller wrote:
> >
> > > Hi Paul,
> > >
> > > I also have to treat things differently depending on if I am in a
> > > proximity clause or boolean clause. A wildcard in a boolean is mapped
> > to
> > > a wildcard query. A wildcard in a proximity is mapped to a regex span
> > > that has been modified to only deal with * and ?. When I run into a
> > > proximity, I collect a small tree of each clause and distribute them
> > > against each other...(old | map) ~3 big gets distributed to old ~3 big
> > |
> > > map ~3 big. This distribution method appears to handle all
> >
> > There is no need to repeat "big". SpanQueries can be nested,
> > so when mapping like this:
> > SpanNear(SpanOr( old, map), big)
> > the query structure will only grow for truncations and fuzzy stuff.
> >
> > > boolean/proximity nesting/mixing cases for me, including: great ! "big
> > > old phrase search" ~5 (holy ~4 (big black bear)). The distribution
> > > maintains order of operations, but also obviously can create some
> > pretty
> > > large queries.
> >
> > Regards,
> > Paul Elschot
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: SV: GetMoreDocs question

Chris Hostetter-3
In reply to this post by Marcus Falck

: But why do I have to reterive atleast 1 document when im using the TopDocs ?
: (If I set nDoc to 0 it will throw an exception).

i didn't say you had to, i just saaid "maybe" ... i don't know whatthe
behavior is if you use 0 -- ideally it would work fine, but in practice i
do't know if anyone has ever tested that case.

:
: /
: Marcus
:
: -----Ursprungligt meddelande-----
: Från: Chris Hostetter [mailto:[hidden email]]
: Skickat: den 31 augusti 2006 20:09
: Till: [hidden email]
: Ämne: Re: GetMoreDocs question
:
:
: : I have some questions regarding the GetMoreDocs(50) call in the
: : constructors of the Hits class.
:
: : First off whats the purposes of this call?
:
: Hits is designed to meet the simple needs of simple clients -- the
: assumption is that clients using Hits want simple paginated results - so
: Hits goes ahead and gets you page#1.
:
: : Can I do something so I don't cache up 100 docs when I'm just asking for
: : a count?
:
: This is what the search method that returns TopDocs is designed for --
: when all you care about is the totalHits and the first N docs (in your
: case N is 0 or maybe 1)
:
:
:
: -Hoss
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Proximity Query Parser

Paul Elschot
In reply to this post by Mark Miller-3
On Friday 01 September 2006 19:46, Mark Miller wrote:
> Eric also gave me the idea of using a SpanNear with maximum slop as a
> boolean to connect spans. Using this and SpanOr seems to make my time spent
> on the distribution of proximity clauses a little foolish :) Is that true?

There is practice and there is theory. You chose practice this time.
(In theory there is no difference between the two, but in practice...)

> Is there any disadvantage to the max slop Spannear, SpanOr solution? Any
> advantage to distributing the 'and's?

Span queries (and phrase queries) access the proximity information,
and that slows them down when compared to pure boolean queries,
which can get away by using only the the term frequencies in the
documents. The difference in access time is roughly as big as these
term frequencies.
When querying an index with larger documents, the difference can be
quite noticable. However, using proximity information normally
gives more accurate results. With operators in the query language,
the choice is up to the user.

Similarly, phrase queries are faster than span queries, but phrase queries
cannot be nested. Ideally, a query language would hide this, but
this requires an implementation in which phrase queries treat slop
in the same way as span queries.
 
Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]