SpanNearQuery's spans & payloads

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

SpanNearQuery's spans & payloads

Michael McCandless-2
Under LUCENE-1458, I'm hitting a curious test failure in
TestPositionsIncrement.testPayloadsPos0.  The failure happens because
the codec I'm testing (pulsing codec) allows you to retrieve the same
payload more than once if the term was pulsed (inlined into terms
dict), whereas w/ trunk you can only retrieve the payload once.

But in debugging the failure, I'm struggling with what the correct
behavior of SpanNearQuery really should be.

The test creates a single doc with one analyzed field, with these
single letter position:tokens:

   0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k

every token has a payload.

Then it makes:

  SpanNearQuery
    SpanTermQuery term=a
    SpanTermQuery term=k

Term "a" occurs four times (positions 0, 1, 3, 6) and "k" occurs 2
times (positions 7, 8).

My first question is: what spans is SpanNearQuery supposed to
enumerate?  Right now trunk does these four:

   span 0 to 8
   span 1 to 8
   span 3 to 8
   span 6 to 8

which represents position 7 of "k" mated with all positions of "a".
(remember end is 1+, so "k"'s position 7 turned into 8).  How come the
position 8 occurrence of "k" was not included in any spans?

My second question is: when you call getPayload() on each span, what
should you get?  Right now trunk does this:

    span 0 to 8
      payload: pos: 0
      payload: pos: 7
    span 1 to 8
      payload: pos: 0
    span 3 to 8
      payload: pos: 3
    span 6 to 8
      payload: pos: 6

The first span properly includes the payload for "a" (pos: 0) and for
"k" (pos: 7), but the the subsequent three do not include the payload
for "k".  Shouldn't you get all payloads associated w/ the span?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

NumericRange Field and LuceneUtils?

Daniel Shane-2
Is it normal that LuceneUtils.getString(Document document, String
fieldName) uses document.getField() in the background?

If, for example, you indexed something using the new NumericRange field,
then you will get a class cast exception in there.

Would it not be better to call getFieldable() instead of getField()?

Daniel Shane

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NumericRange Field and LuceneUtils?

Yonik Seeley-2-2
Sounds right... but this LuceneUtils class isn't part of Lucene is it?

-Yonik
http://www.lucidimagination.com



On Fri, Sep 11, 2009 at 3:01 PM, Daniel Shane <[hidden email]> wrote:
> Is it normal that LuceneUtils.getString(Document document, String fieldName)
> uses document.getField() in the background?
>
> If, for example, you indexed something using the new NumericRange field,
> then you will get a class cast exception in there.
>
> Would it not be better to call getFieldable() instead of getField()?
>
> Daniel Shane

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Mark Miller-3
In reply to this post by Michael McCandless-2
I'd have to dig in to be of much help. Hard to remember this stuff.

0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k

 span 0 to 8
 span 1 to 8
 span 3 to 8
 span 6 to 8

I think those are the right 4. You start on the left and work right. Spans always start after
the last one started.

So first you would find: 0 to 8. After 0, 1 to 8.
After 1, 3 to 8, and after 3, 6 to 8. That makes sense.
You never see 9 because the 8 comes first and you can
end as many times on a pos as you want - but you dont
ever start a span at the same pos. So I think this is right.

The second question I am less sure about without looking at code.
I think its because each payload can only be loaded once. So the first
time you hit 0 to 8, you get both payloads - but every other span that
hits 8, that payload was already loaded ? So you get all of the payloads
you should, your just not duplicates in each span. I'd have to think
harder about it - but overall it appears right ... ? All the Spans
are subspans of a larger Span right?

- Mark



Michael McCandless wrote:

> Under LUCENE-1458, I'm hitting a curious test failure in
> TestPositionsIncrement.testPayloadsPos0.  The failure happens because
> the codec I'm testing (pulsing codec) allows you to retrieve the same
> payload more than once if the term was pulsed (inlined into terms
> dict), whereas w/ trunk you can only retrieve the payload once.
>
> But in debugging the failure, I'm struggling with what the correct
> behavior of SpanNearQuery really should be.
>
> The test creates a single doc with one analyzed field, with these
> single letter position:tokens:
>
>    0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k
>
> every token has a payload.
>
> Then it makes:
>
>   SpanNearQuery
>     SpanTermQuery term=a
>     SpanTermQuery term=k
>
> Term "a" occurs four times (positions 0, 1, 3, 6) and "k" occurs 2
> times (positions 7, 8).
>
> My first question is: what spans is SpanNearQuery supposed to
> enumerate?  Right now trunk does these four:
>
>    span 0 to 8
>    span 1 to 8
>    span 3 to 8
>    span 6 to 8
>
> which represents position 7 of "k" mated with all positions of "a".
> (remember end is 1+, so "k"'s position 7 turned into 8).  How come the
> position 8 occurrence of "k" was not included in any spans?
>
> My second question is: when you call getPayload() on each span, what
> should you get?  Right now trunk does this:
>
>     span 0 to 8
>       payload: pos: 0
>       payload: pos: 7
>     span 1 to 8
>       payload: pos: 0
>     span 3 to 8
>       payload: pos: 3
>     span 6 to 8
>       payload: pos: 6
>
> The first span properly includes the payload for "a" (pos: 0) and for
> "k" (pos: 7), but the the subsequent three do not include the payload
> for "k".  Shouldn't you get all payloads associated w/ the span?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Michael McCandless-2
Thanks Mark! -- comments below:

On Fri, Sep 11, 2009 at 3:34 PM, Mark Miller <[hidden email]> wrote:

> I'd have to dig in to be of much help. Hard to remember this stuff.
>
> 0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k
>
>  span 0 to 8
>  span 1 to 8
>  span 3 to 8
>  span 6 to 8
>
> I think those are the right 4. You start on the left and work
> right. Spans always start after the last one started.

OK, so SpanNearQuery always takes its left-most clause, releases a
span, and then advances it?  What if there is a tie for two left-most
clauses?

Eg if I had included "b" as a clause, here, it'd tie with "a" at
position 1 -- hmm, I just tested this: you get "span 1 to 8" twice:

    span 0 to 8
       payload: pos: 7
       payload: pos: 1
       payload: pos: 0
    span 1 to 8
       payload: pos: 0
    span 1 to 8
       payload: pos: 3
    span 3 to 8
       payload: pos: 6
    span 6 to 8
       payload: pos: 6

Also, the payloads sort of shifted down (eg "pos: 3" now shows up in
the "span 1 to 8" but before showed up in "span 3 to 8"), and "pos: 1"
(for b) was added under "span 0 to 8".

(NOTE: confusingly, the "payload: pos: N" is off by one, in this test,
ie the "real" position is N+1).

> So first you would find: 0 to 8. After 0, 1 to 8.
> After 1, 3 to 8, and after 3, 6 to 8. That makes sense.
> You never see 9 because the 8 comes first and you can
> end as many times on a pos as you want - but you dont
> ever start a span at the same pos. So I think this is right.

I think (if I were using SpanNearQuery) I'd want it to somehow include
9, but I'm not quite sure how.  This test sets slop to 30, so maybe
I'd want to see 0-9, 1-9, 3-9, 6-9?  Ie the "maximal" spans possible.
EG my app will never see "k"'s payload from its occurrence at position
8.

> The second question I am less sure about without looking at code.
> I think its because each payload can only be loaded once. So the first
> time you hit 0 to 8, you get both payloads - but every other span that
> hits 8, that payload was already loaded ? So you get all of the payloads
> you should, your just not duplicates in each span. I'd have to think
> harder about it - but overall it appears right ... ?

Yeah that is the reason why you only see each payload once, but I'm
not sure that's "right".  I guess an app can always store away each
payload and pull it later, but eg it the app wants to score each span
using the payloads from all occurrences of clauses within it, you
can't trust getPayloads for that.

> All the Spans are subspans of a larger Span right?

Not sure what you mean here?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: NumericRange Field and LuceneUtils?

Uwe Schindler
In reply to this post by Daniel Shane-2
Hallo Daniel,

I am not really sure what you are talking about (what is LuceneUtils?).

To your question about NumericField: NumericField is only used for indexing.
If you also store the field to retrieve it from the index e.g. with search
results, NumericField creates a stored Field containing the number as a
conventional string (the special trie encoding is only used for *indexing*
not *storing*). If you call getField() it returns a standard Field containg
the number as String.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Daniel Shane [mailto:[hidden email]]
> Sent: Friday, September 11, 2009 9:01 PM
> To: [hidden email]
> Subject: NumericRange Field and LuceneUtils?
>
> Is it normal that LuceneUtils.getString(Document document, String
> fieldName) uses document.getField() in the background?
>
> If, for example, you indexed something using the new NumericRange field,
> then you will get a class cast exception in there.
>
> Would it not be better to call getFieldable() instead of getField()?
>
> Daniel Shane
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: NumericRange Field and LuceneUtils?

Uwe Schindler
By the way: This is documented:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/core/org/apac
he/lucene/document/NumericField.html

NOTE: This class is only used during indexing. When retrieving the stored
field value from a Document instance after search, you will get a
conventional Fieldable instance where the numeric values are returned as
Strings (according to toString(value) of the used data type).

(this o.a.l.document.Fieldable is always a o.a.l.document.Field)

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: Uwe Schindler [mailto:[hidden email]]
> Sent: Friday, September 11, 2009 10:38 PM
> To: [hidden email]
> Subject: RE: NumericRange Field and LuceneUtils?
>
> Hallo Daniel,
>
> I am not really sure what you are talking about (what is LuceneUtils?).
>
> To your question about NumericField: NumericField is only used for
> indexing.
> If you also store the field to retrieve it from the index e.g. with search
> results, NumericField creates a stored Field containing the number as a
> conventional string (the special trie encoding is only used for *indexing*
> not *storing*). If you call getField() it returns a standard Field
> containg
> the number as String.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
> > -----Original Message-----
> > From: Daniel Shane [mailto:[hidden email]]
> > Sent: Friday, September 11, 2009 9:01 PM
> > To: [hidden email]
> > Subject: NumericRange Field and LuceneUtils?
> >
> > Is it normal that LuceneUtils.getString(Document document, String
> > fieldName) uses document.getField() in the background?
> >
> > If, for example, you indexed something using the new NumericRange field,
> > then you will get a class cast exception in there.
> >
> > Would it not be better to call getFieldable() instead of getField()?
> >
> > Daniel Shane
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NumericRange Field and LuceneUtils?

Yonik Seeley
On Fri, Sep 11, 2009 at 4:45 PM, Uwe Schindler <[hidden email]> wrote:

> By the way: This is documented:
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/core/org/apac
> he/lucene/document/NumericField.html
>
> NOTE: This class is only used during indexing. When retrieving the stored
> field value from a Document instance after search, you will get a
> conventional Fieldable instance where the numeric values are returned as
> Strings (according to toString(value) of the used data type).
>
> (this o.a.l.document.Fieldable is always a o.a.l.document.Field)

Lazy loading could return a different implementation.  Even w/o lazy
loading, we're also not going to guarantee that a Fieldable is always
a Field, right?  Perhaps those methods returning a Field should be
deprecated sometime.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Grant Ingersoll-2
In reply to this post by Michael McCandless-2
Just FYI I recall a fair amount of discussion on SpanNear:

http://www.lucidimagination.com/search/s:email/l:dev?q=SpanNearQuery
http://www.lucidimagination.com/search/?q=NearSpansOrdered#/s:email/l:dev
See also http://issues.apache.org/jira/browse/LUCENE-1001

I remember being very confused by NearSpansOrdered and UnOrdered and  
also thinking there are some oddities (scoring not withstanding).

On Sep 11, 2009, at 2:32 PM, Michael McCandless wrote:

> Under LUCENE-1458, I'm hitting a curious test failure in
> TestPositionsIncrement.testPayloadsPos0.  The failure happens because
> the codec I'm testing (pulsing codec) allows you to retrieve the same
> payload more than once if the term was pulsed (inlined into terms
> dict), whereas w/ trunk you can only retrieve the payload once.
>
> But in debugging the failure, I'm struggling with what the correct
> behavior of SpanNearQuery really should be.
>
> The test creates a single doc with one analyzed field, with these
> single letter position:tokens:
>
>   0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k
>
> every token has a payload.
>
> Then it makes:
>
>  SpanNearQuery
>    SpanTermQuery term=a
>    SpanTermQuery term=k
>
> Term "a" occurs four times (positions 0, 1, 3, 6) and "k" occurs 2
> times (positions 7, 8).
>
> My first question is: what spans is SpanNearQuery supposed to
> enumerate?  Right now trunk does these four:
>
>   span 0 to 8
>   span 1 to 8
>   span 3 to 8
>   span 6 to 8
>
> which represents position 7 of "k" mated with all positions of "a".
> (remember end is 1+, so "k"'s position 7 turned into 8).  How come the
> position 8 occurrence of "k" was not included in any spans?
>
> My second question is: when you call getPayload() on each span, what
> should you get?  Right now trunk does this:
>
>    span 0 to 8
>      payload: pos: 0
>      payload: pos: 7
>    span 1 to 8
>      payload: pos: 0
>    span 3 to 8
>      payload: pos: 3
>    span 6 to 8
>      payload: pos: 6
>
> The first span properly includes the payload for "a" (pos: 0) and for
> "k" (pos: 7), but the the subsequent three do not include the payload
> for "k".  Shouldn't you get all payloads associated w/ the span?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Mark Miller-3
In reply to this post by Michael McCandless-2
Michael McCandless wrote:

> Thanks Mark! -- comments below:
>
> On Fri, Sep 11, 2009 at 3:34 PM, Mark Miller <[hidden email]> wrote:
>
>  
>> I'd have to dig in to be of much help. Hard to remember this stuff.
>>
>> 0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k
>>
>>  span 0 to 8
>>  span 1 to 8
>>  span 3 to 8
>>  span 6 to 8
>>
>> I think those are the right 4. You start on the left and work
>> right. Spans always start after the last one started.
>>    
>
> OK, so SpanNearQuery always takes its left-most clause, releases a
> span, and then advances it?  What if there is a tie for two left-most
> clauses?
>
> Eg if I had included "b" as a clause, here, it'd tie with "a" at
> position 1 -- hmm, I just tested this: you get "span 1 to 8" twice:
>
>     span 0 to 8
>        payload: pos: 7
>        payload: pos: 1
>        payload: pos: 0
>     span 1 to 8
>        payload: pos: 0
>     span 1 to 8
>        payload: pos: 3
>     span 3 to 8
>        payload: pos: 6
>     span 6 to 8
>        payload: pos: 6
>
> Also, the payloads sort of shifted down (eg "pos: 3" now shows up in
> the "span 1 to 8" but before showed up in "span 3 to 8"), and "pos: 1"
> (for b) was added under "span 0 to 8".
>
> (NOTE: confusingly, the "payload: pos: N" is off by one, in this test,
> ie the "real" position is N+1).
>
>  
>> So first you would find: 0 to 8. After 0, 1 to 8.
>> After 1, 3 to 8, and after 3, 6 to 8. That makes sense.
>> You never see 9 because the 8 comes first and you can
>> end as many times on a pos as you want - but you dont
>> ever start a span at the same pos. So I think this is right.
>>    
>
> I think (if I were using SpanNearQuery) I'd want it to somehow include
> 9, but I'm not quite sure how.  This test sets slop to 30, so maybe
> I'd want to see 0-9, 1-9, 3-9, 6-9?  Ie the "maximal" spans possible.
> EG my app will never see "k"'s payload from its occurrence at position
> 8.
>  
You might want it, but thats not how Spans currently works - they are
not exhaustive.
They start at the left and march right - each Span always starting after
the last started,
but ending at the closest match. Its just how the query works, and so
when payloads was
grafted on ... they are made to match documents quickly - not enumerate
all matches in
a document (I guess).

You might want exhaustive for highlighting as well - but its different
algorithms ...

>  
>> The second question I am less sure about without looking at code.
>> I think its because each payload can only be loaded once. So the first
>> time you hit 0 to 8, you get both payloads - but every other span that
>> hits 8, that payload was already loaded ? So you get all of the payloads
>> you should, your just not duplicates in each span. I'd have to think
>> harder about it - but overall it appears right ... ?
>>    
>
> Yeah that is the reason why you only see each payload once, but I'm
> not sure that's "right".  I guess an app can always store away each
> payload and pull it later, but eg it the app wants to score each span
> using the payloads from all occurrences of clauses within it, you
> can't trust getPayloads for that.
>  
Fair enough - my idea of what appears right is tainted - I finished getting
NearSpansOrdered to work with payloads and I've fixed some bugs -
but I've never considered how it *should* work - I've just cursed and
moved on trying to get what we have to work.

In the end, I accepted my definition of works as - when I ask for the
payloads
back, will I end up with a bag of all the payloads that the Spans touched. I
think you do. If each sub Span duplicated payloads, they might be right for
some apps and it might be a pain for others right? You can't count on
the order
of the payloads or anything I think (been a while) - so its just like
getting a bag
back of those that matched.

Anyway - I'm not happy with a few things, but it was fairly hard just
getting things
to work at this level. I'd love for NearSpansOrdered to actually lazy
load the payloads
for one.
>  
>> All the Spans are subspans of a larger Span right?
>>    
Sorry ;) I'm practicing with my chaotic brain so that one day I may
actually be half way clear.

I meant, all those Spans came from one query - so you got your bag of
payloads right? If each Span
was a separate entity, it would obviously be way wrong - but from a
single SpanQuery, at least you
got all the payloads in some form :)

I'd love to be able to give some more intelligent responses here, but
I'd have to dig back into the code
again first. Spans were hard enough to deal with without adding these
payloads to the mix :)

>
> Not sure what you mean here?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: NumericRange Field and LuceneUtils?

hossman
In reply to this post by Daniel Shane-2

: Subject: NumericRange Field and LuceneUtils?
: References: <[hidden email]>
: In-Reply-To: <[hidden email]>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: NumericRange Field and LuceneUtils?

Uwe Schindler
In reply to this post by Yonik Seeley
> On Fri, Sep 11, 2009 at 4:45 PM, Uwe Schindler <[hidden email]> wrote:
> > By the way: This is documented:
> > http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/javadoc/core/org/apac
> > he/lucene/document/NumericField.html
> >
> > NOTE: This class is only used during indexing. When retrieving the
> stored
> > field value from a Document instance after search, you will get a
> > conventional Fieldable instance where the numeric values are returned as
> > Strings (according to toString(value) of the used data type).
> >
> > (this o.a.l.document.Fieldable is always a o.a.l.document.Field)
>
> Lazy loading could return a different implementation.  Even w/o lazy
> loading, we're also not going to guarantee that a Fieldable is always
> a Field, right?  Perhaps those methods returning a Field should be
> deprecated sometime.

Yes. But this is not related to NumericField at all. But it would provide us
the possibility to return also NumericField instances from stored fields
some time in future.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Michael McCandless-2
In reply to this post by Mark Miller-3
OK thanks for the responses.  This is indeed tricky stuff!

On Sat, Sep 12, 2009 at 12:28 AM, Mark Miller <[hidden email]> wrote:

> They start at the left and march right - each Span always starting
> after the last started,

That's not quite always true -- eg I got span 1-8, twice, once I added
"b" as a clause to the SNQ.

> You might want exhaustive for highlighting as well - but its
> different algorithms ...

Yeah, how we would represent spans for highlighting is tricky... we
had discussed this ("how to represent spans for aggregate queries")
recently, I think under LUCENE-1522.

I think we'd have to return a tree structure, that mirrors the query's
tree structure, to hold the spans, rather than try to enumerate
("denormalize") all possible expansions.  Each leaf node would hold
actual data (position, term, payload, etc.), and then the tree nodes
would express how they are and/ord/near'd together.  My app could then
walk the tree to compute any combination I wanted.

> In the end, I accepted my definition of works as - when I ask for
> the payloads back, will I end up with a bag of all the payloads that
> the Spans touched. I think you do.

Yeah I think you do, except each payload is only returned once.  So
it's only the first span that hits a payload that will return it.

So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
enumerates the spans, eg I'll never see that 2nd occurrence of "k",
nor its associated payload.

For now I'll just match this behavior ("can only load payload once")
in all codecs in LUCENE-1458... the test passes again once I do that.

> I meant, all those Spans came from one query - so you got your bag
> of payloads right? If each Span was a separate entity, it would
> obviously be way wrong - but from a single SpanQuery, at least you
> got all the payloads in some form :)

Right, this is all one query... but the payload for the 2nd
occurrence of "k" was never included in any span so I didn't get "all"
payloads.

Maybe if/once we incorporate spans into Lucene's normal queries
(optionally, so there's no performance hit if you don't ask for them)
we can re-visit these issues.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Grant Ingersoll-2

On Sep 12, 2009, at 5:12 AM, Michael McCandless wrote:

> OK thanks for the responses.  This is indeed tricky stuff!
>
> On Sat, Sep 12, 2009 at 12:28 AM, Mark Miller  
> <[hidden email]> wrote:
>
>> They start at the left and march right - each Span always starting
>> after the last started,
>
> That's not quite always true -- eg I got span 1-8, twice, once I added
> "b" as a clause to the SNQ.
>
>> You might want exhaustive for highlighting as well - but its
>> different algorithms ...
>
> Yeah, how we would represent spans for highlighting is tricky... we
> had discussed this ("how to represent spans for aggregate queries")
> recently, I think under LUCENE-1522.
>
> I think we'd have to return a tree structure, that mirrors the query's
> tree structure, to hold the spans, rather than try to enumerate
> ("denormalize") all possible expansions.  Each leaf node would hold
> actual data (position, term, payload, etc.), and then the tree nodes
> would express how they are and/ord/near'd together.  My app could then
> walk the tree to compute any combination I wanted.
>
>> In the end, I accepted my definition of works as - when I ask for
>> the payloads back, will I end up with a bag of all the payloads that
>> the Spans touched. I think you do.
>
> Yeah I think you do, except each payload is only returned once.  So
> it's only the first span that hits a payload that will return it.
>
> So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
> enumerates the spans, eg I'll never see that 2nd occurrence of "k",
> nor its associated payload.
>

I believe this is my understanding as well.  If Doug and Paul chime  
in, maybe we will know better.

That being said, I think it is reasonable to want to have an  
exhaustive list of matches, even when they overlap.  We simply could  
create a new SpanNear that does this.


> For now I'll just match this behavior ("can only load payload once")
> in all codecs in LUCENE-1458... the test passes again once I do that.
>
>> I meant, all those Spans came from one query - so you got your bag
>> of payloads right? If each Span was a separate entity, it would
>> obviously be way wrong - but from a single SpanQuery, at least you
>> got all the payloads in some form :)
>
> Right, this is all one query... but the payload for the 2nd
> occurrence of "k" was never included in any span so I didn't get "all"
> payloads.
>
> Maybe if/once we incorporate spans into Lucene's normal queries
> (optionally, so there's no performance hit if you don't ask for them)
> we can re-visit these issues.

Good luck with that!  :-)  The SpanQuery themselves ask for them as it  
is now.  The bigger bugaboo to fix, I think, is the use case I laid  
out a bit ago where it is a real pain to coalesce both the results of  
running the query with effectively accessing the Spans and not having  
to constantly reset/skipTo.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Mark Miller-3
In reply to this post by Michael McCandless-2
Michael McCandless wrote:

> OK thanks for the responses.  This is indeed tricky stuff!
>
> On Sat, Sep 12, 2009 at 12:28 AM, Mark Miller <[hidden email]> wrote:
>
>  
>> They start at the left and march right - each Span always starting
>> after the last started,
>>    
>
> That's not quite always true -- eg I got span 1-8, twice, once I added
> "b" as a clause to the SNQ.
>  
Mmm - right - depends on how you look at it I think - it is less simple
with terms at multiple positions, in that now each Span doesn't start
in the *position* after the last - but if you line up the terms like you
did, its still the same - the first 1 - 8 starts at the first term at
pos 1, and
the next 1 to 8 starts at the seconds term at pos 1. One starts after
the other (though if you think Lucene positions, I realize they virtually
start at the same spot).

>  
>> You might want exhaustive for highlighting as well - but its
>> different algorithms ...
>>    
>
> Yeah, how we would represent spans for highlighting is tricky... we
> had discussed this ("how to represent spans for aggregate queries")
> recently, I think under LUCENE-1522.
>
> I think we'd have to return a tree structure, that mirrors the query's
> tree structure, to hold the spans, rather than try to enumerate
> ("denormalize") all possible expansions.  Each leaf node would hold
> actual data (position, term, payload, etc.), and then the tree nodes
> would express how they are and/ord/near'd together.  My app could then
> walk the tree to compute any combination I wanted.
>
>  
>> In the end, I accepted my definition of works as - when I ask for
>> the payloads back, will I end up with a bag of all the payloads that
>> the Spans touched. I think you do.
>>    
>
> Yeah I think you do, except each payload is only returned once.  So
> it's only the first span that hits a payload that will return it.
>
> So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
> enumerates the spans, eg I'll never see that 2nd occurrence of "k",
> nor its associated payload.
>  
Not only not guaranteed, but its just not going to happen - its not
how spans match. If I say find n within 300 of m with the following:

n m m m m m m m m m m m m  m m m m m m m m m m m m m m m m m m m m m m
m  m m m m m m m m m m m

Only the first m will match. It will start at the left, find the n, then
say great, an m within 300, this doc matches, we are done. There is
not another n to start on or finish on to the right. It doesn't then
touch the next 300 m's - just they way Doug implemented them from what I
can tell. Its only exhaustive from the
left - find m within 300 of n, order matters (m first)

m m m m m m m m m m m m m m m m m m n

This will be a bunch of spans - start at the left - the first m to n
matches, then the second m - n matches, then the third m to n matches,
and so on as we move right.

> For now I'll just match this behavior ("can only load payload once")
> in all codecs in LUCENE-1458... the test passes again once I do that.
>
>  
>> I meant, all those Spans came from one query - so you got your bag
>> of payloads right? If each Span was a separate entity, it would
>> obviously be way wrong - but from a single SpanQuery, at least you
>> got all the payloads in some form :)
>>    
>
> Right, this is all one query... but the payload for the 2nd
> occurrence of "k" was never included in any span so I didn't get "all"
> payloads.
>  
You got all the payloads the query matched - I think you need a
different query (or
we change the Spans algorithm completely)

> Maybe if/once we incorporate spans into Lucene's normal queries
> (optionally, so there's no performance hit if you don't ask for them)
> we can re-visit these issues.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Mark Miller-3
In other words, Spans is guaranteed to find a document *if* a set of
terms match the positional constraints - if bush is within 20 of george,
its guaranteed to find that - but it doesn't give any concern to finding
every george within 20 of bush (though it may find multiple, or even
all of them depending on how the text is setup and the query constraints).

Mark Miller wrote:

> Michael McCandless wrote:
>  
>> OK thanks for the responses.  This is indeed tricky stuff!
>>
>> On Sat, Sep 12, 2009 at 12:28 AM, Mark Miller <[hidden email]> wrote:
>>
>>  
>>    
>>> They start at the left and march right - each Span always starting
>>> after the last started,
>>>    
>>>      
>> That's not quite always true -- eg I got span 1-8, twice, once I added
>> "b" as a clause to the SNQ.
>>  
>>    
> Mmm - right - depends on how you look at it I think - it is less simple
> with terms at multiple positions, in that now each Span doesn't start
> in the *position* after the last - but if you line up the terms like you
> did, its still the same - the first 1 - 8 starts at the first term at
> pos 1, and
> the next 1 to 8 starts at the seconds term at pos 1. One starts after
> the other (though if you think Lucene positions, I realize they virtually
> start at the same spot).
>  
>>  
>>    
>>> You might want exhaustive for highlighting as well - but its
>>> different algorithms ...
>>>    
>>>      
>> Yeah, how we would represent spans for highlighting is tricky... we
>> had discussed this ("how to represent spans for aggregate queries")
>> recently, I think under LUCENE-1522.
>>
>> I think we'd have to return a tree structure, that mirrors the query's
>> tree structure, to hold the spans, rather than try to enumerate
>> ("denormalize") all possible expansions.  Each leaf node would hold
>> actual data (position, term, payload, etc.), and then the tree nodes
>> would express how they are and/ord/near'd together.  My app could then
>> walk the tree to compute any combination I wanted.
>>
>>  
>>    
>>> In the end, I accepted my definition of works as - when I ask for
>>> the payloads back, will I end up with a bag of all the payloads that
>>> the Spans touched. I think you do.
>>>    
>>>      
>> Yeah I think you do, except each payload is only returned once.  So
>> it's only the first span that hits a payload that will return it.
>>
>> So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
>> enumerates the spans, eg I'll never see that 2nd occurrence of "k",
>> nor its associated payload.
>>  
>>    
> Not only not guaranteed, but its just not going to happen - its not
> how spans match. If I say find n within 300 of m with the following:
>
> n m m m m m m m m m m m m  m m m m m m m m m m m m m m m m m m m m m m
> m  m m m m m m m m m m m
>
> Only the first m will match. It will start at the left, find the n, then
> say great, an m within 300, this doc matches, we are done. There is
> not another n to start on or finish on to the right. It doesn't then
> touch the next 300 m's - just they way Doug implemented them from what I
> can tell. Its only exhaustive from the
> left - find m within 300 of n, order matters (m first)
>
> m m m m m m m m m m m m m m m m m m n
>
> This will be a bunch of spans - start at the left - the first m to n
> matches, then the second m - n matches, then the third m to n matches,
> and so on as we move right.
>  
>> For now I'll just match this behavior ("can only load payload once")
>> in all codecs in LUCENE-1458... the test passes again once I do that.
>>
>>  
>>    
>>> I meant, all those Spans came from one query - so you got your bag
>>> of payloads right? If each Span was a separate entity, it would
>>> obviously be way wrong - but from a single SpanQuery, at least you
>>> got all the payloads in some form :)
>>>    
>>>      
>> Right, this is all one query... but the payload for the 2nd
>> occurrence of "k" was never included in any span so I didn't get "all"
>> payloads.
>>  
>>    
> You got all the payloads the query matched - I think you need a
> different query (or
> we change the Spans algorithm completely)
>  
>> Maybe if/once we incorporate spans into Lucene's normal queries
>> (optionally, so there's no performance hit if you don't ask for them)
>> we can re-visit these issues.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>  
>>    
>
>
>  


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Mark Miller-3
In reply to this post by Mark Miller-3
Mark Miller wrote:

>
>> Yeah I think you do, except each payload is only returned once.  So
>> it's only the first span that hits a payload that will return it.
>>
>> So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
>> enumerates the spans, eg I'll never see that 2nd occurrence of "k",
>> nor its associated payload.
>>  
>>    
> Not only not guaranteed, but its just not going to happen - its not
> how spans match. If I say find n within 300 of m with the following:
>
> n m m m m m m m m m m m m  m m m m m m m m m m m m m m m m m m m m m m
> m  m m m m m m m m m m m
>
> Only the first m will match. It will start at the left, find the n, then
> say great, an m within 300, this doc matches, we are done. There is
> not another n to start on or finish on to the right. It doesn't then
> touch the next 300 m's - just they way Doug implemented them from what I
> can tell. Its only exhaustive from the
> left - find m within 300 of n, order matters (m first)
>
> m m m m m m m m m m m m m m m m m m n
>
> This will be a bunch of spans - start at the left - the first m to n
> matches, then the second m - n matches, then the third m to n matches,
> and so on as we move right.
>  
You can figure out what will match using the Span rules I mentioned by
the way (at least
I believe so).

Those rules are simple (this is my current working knowledge and I don't
guarantee it - but I havn't seen it off yet) -

1. Only one span can start from a term.
2. Start matching from the left and work right.

so applying to your example:

  SpanNearQuery
    SpanTermQuery term=a
    SpanTermQuery term=k


0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k
>
>  span 0 to 8
>  span 1 to 8
>  span 3 to 8
>  span 6 to 8

So first  we see 0 which is an 8 - we draw our span because the k at 7
is within 30: 0-8.
We move move right now, because we can't start at that term again.
Another a - and again the
k at 7 is within 30 - mark our span 1-8. Now we have to move right one
at least, but we don't
find the next a till 3 - again there is a k within 30 at 7 - mark our
span: 3-8. Now move right a
term at least - we find another a at 6 - again there is a k within 30 at
7 - mark our span: 6-8.
Now we are done. We never needed or used the k at 8 (ends at 9) in the
Spans algorithm.

--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Mark Miller-3
Sorry for the spam - type of '8' instead of 'a' - hard enough to follow
without that - read this one below instead:

Mark Miller wrote:

> Mark Miller wrote:
>  
>>> Yeah I think you do, except each payload is only returned once.  So
>>> it's only the first span that hits a payload that will return it.
>>>
>>> So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
>>> enumerates the spans, eg I'll never see that 2nd occurrence of "k",
>>> nor its associated payload.
>>>  
>>>    
>>>      
>> Not only not guaranteed, but its just not going to happen - its not
>> how spans match. If I say find n within 300 of m with the following:
>>
>> n m m m m m m m m m m m m  m m m m m m m m m m m m m m m m m m m m m m
>> m  m m m m m m m m m m m
>>
>> Only the first m will match. It will start at the left, find the n, then
>> say great, an m within 300, this doc matches, we are done. There is
>> not another n to start on or finish on to the right. It doesn't then
>> touch the next 300 m's - just they way Doug implemented them from what I
>> can tell. Its only exhaustive from the
>> left - find m within 300 of n, order matters (m first)
>>
>> m m m m m m m m m m m m m m m m m m n
>>
>> This will be a bunch of spans - start at the left - the first m to n
>> matches, then the second m - n matches, then the third m to n matches,
>> and so on as we move right.
>>  
>>    
> You can figure out what will match using the Span rules I mentioned by
> the way (at least
> I believe so).
>
> Those rules are simple (this is my current working knowledge and I don't
> guarantee it - but I havn't seen it off yet) -
>
> 1. Only one span can start from a term.
> 2. Start matching from the left and work right.
>
> so applying to your example:
>
>   SpanNearQuery
>     SpanTermQuery term=a
>     SpanTermQuery term=k
>
>
> 0:a 1:a 1:b 2:c 2:d 3:e 3:a 4:f 4:g 5:h 5:i 6:j 6:a 7:b 7:k 8:k
>  
>>  span 0 to 8
>>  span 1 to 8
>>  span 3 to 8
>>  span 6 to 8
>>    
>
> So first  we see 0 which is an a - we draw our span because the k at 7
> is within 30: 0-8.
> We move move right now, because we can't start at that term again.
> Another a - and again the
> k at 7 is within 30 - mark our span 1-8. Now we have to move right one
> at least, but we don't
> find the next a till 3 - again there is a k within 30 at 7 - mark our
> span: 3-8. Now move right a
> term at least - we find another a at 6 - again there is a k within 30 at
> 7 - mark our span: 6-8.
> Now we are done. We never needed or used the k at 8 (ends at 9) in the
> Spans algorithm.
>
>  


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Michael McCandless-2
In reply to this post by Mark Miller-3
On Sat, Sep 12, 2009 at 8:40 AM, Mark Miller <[hidden email]> wrote:

>>> They start at the left and march right - each Span always starting
>>> after the last started,
>>
>> That's not quite always true -- eg I got span 1-8, twice, once I
>> added "b" as a clause to the SNQ.
>
> Mmm - right - depends on how you look at it I think - it is less
> simple with terms at multiple positions, in that now each Span
> doesn't start in the *position* after the last - but if you line up
> the terms like you did, its still the same - the first 1 - 8 starts
> at the first term at pos 1, and the next 1 to 8 starts at the
> seconds term at pos 1. One starts after the other (though if you
> think Lucene positions, I realize they virtually start at the same
> spot).

Ahh ok got it -- each underying "start" of the span always advances
after that span is returned.

Thanks for all the explanations Mark!  I understand how it works now.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanNearQuery's spans & payloads

Paul Elschot
In reply to this post by Mark Miller-3
On Saturday 12 September 2009 14:40:28 Mark Miller wrote:
> Michael McCandless wrote:
> > OK thanks for the responses. This is indeed tricky stuff!
> >
> > On Sat, Sep 12, 2009 at 12:28 AM, Mark Miller <[hidden email]> wrote:
> >
> >
> >> They start at the left and march right - each Span always starting
> >> after the last started,
> >>
> >
> > That's not quite always true -- eg I got span 1-8, twice, once I added
> > "b" as a clause to the SNQ.
> >
> Mmm - right - depends on how you look at it I think - it is less simple
> with terms at multiple positions, in that now each Span doesn't start
> in the *position* after the last - but if you line up the terms like you
> did, its still the same - the first 1 - 8 starts at the first term at
> pos 1, and
> the next 1 to 8 starts at the seconds term at pos 1. One starts after
> the other (though if you think Lucene positions, I realize they virtually
> start at the same spot).
> >
> >> You might want exhaustive for highlighting as well - but its
> >> different algorithms ...
> >>
> >
> > Yeah, how we would represent spans for highlighting is tricky... we
> > had discussed this ("how to represent spans for aggregate queries")
> > recently, I think under LUCENE-1522.
> >
> > I think we'd have to return a tree structure, that mirrors the query's
> > tree structure, to hold the spans, rather than try to enumerate
> > ("denormalize") all possible expansions. Each leaf node would hold
> > actual data (position, term, payload, etc.), and then the tree nodes
> > would express how they are and/ord/near'd together. My app could then
> > walk the tree to compute any combination I wanted.
> >
> >
> >> In the end, I accepted my definition of works as - when I ask for
> >> the payloads back, will I end up with a bag of all the payloads that
> >> the Spans touched. I think you do.
> >>
> >
> > Yeah I think you do, except each payload is only returned once. So
> > it's only the first span that hits a payload that will return it.
> >
> > So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
> > enumerates the spans, eg I'll never see that 2nd occurrence of "k",
> > nor its associated payload.
> >
> Not only not guaranteed, but its just not going to happen - its not
> how spans match. If I say find n within 300 of m with the following:
>
> n m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m
> m m m m m m m m m m m m
>
> Only the first m will match. It will start at the left, find the n, then
> say great, an m within 300, this doc matches, we are done. There is
> not another n to start on or finish on to the right. It doesn't then
> touch the next 300 m's - just they way Doug implemented them from what I
> can tell. Its only exhaustive from the
> left - find m within 300 of n, order matters (m first)
>
> m m m m m m m m m m m m m m m m m m n
>
> This will be a bunch of spans - start at the left - the first m to n
> matches, then the second m - n matches, then the third m to n matches,
> and so on as we move right.


In the ordered case that last one should only match once, against
the last m.


Regards,
Paul Elschot


> > For now I'll just match this behavior ("can only load payload once")
> > in all codecs in LUCENE-1458... the test passes again once I do that.
> >
> >
> >> I meant, all those Spans came from one query - so you got your bag
> >> of payloads right? If each Span was a separate entity, it would
> >> obviously be way wrong - but from a single SpanQuery, at least you
> >> got all the payloads in some form :)
> >>
> >
> > Right, this is all one query... but the payload for the 2nd
> > occurrence of "k" was never included in any span so I didn't get "all"
> > payloads.
> >
> You got all the payloads the query matched - I think you need a
> different query (or
> we change the Spans algorithm completely)
> > Maybe if/once we incorporate spans into Lucene's normal queries
> > (optionally, so there's no performance hit if you don't ask for them)
> > we can re-visit these issues.
> >
> > Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



12