how to match Documents from Hits with Documents from Query Spans?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

how to match Documents from Hits with Documents from Query Spans?

Boris Galitsky-2
Hello

 

I am using span queries to get hits (Documents) and occurrences
(positions) of search terms within these documents.

For some reason, there is a disagreement between the order the
Documents are returned in hits, and the Documents are referenced (via
order number, starting from 0) in the Spans?

 

The problem is depicted at the diagram below

 

 

Query => Lucene => hits ->Documents

|                                                               |

Spans -> doc(), start(), end()                 |

          \-----------------------------????????----



 

Lucene gets a Query and gives away hits with resultant Documents, and
the occurrences of search expression are obtained form the Query. Why
is there such an odd logic? Again, how to match Documents from Hits
with Documents from Query Spans?

 

Regards

Boris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: how to match Documents from Hits with Documents from Query Spans?

Chris Hostetter-3

: For some reason, there is a disagreement between the order the
: Documents are returned in hits, and the Documents are referenced (via
: order number, starting from 0) in the Spans?

When dealing with a Hits instance, documents are iterated over in "results
order" -- which may be by score, or may be by some other sort you've
specified.

When dealing with a Spans instance, i believe the matches are iterated
over in index order.  Besides the perofrmance reasosn why this may
be true, you also have to keep in mind that the Spans instance has no
idea what ordering you may have used when you executed your search -- even
if it assumed you sorted by score, the SpanQuery may have been a part of a
much larger more complicated query in which the final scores were vastly
different.

If i've missunderstood your problem, could you plee post a JUnit test case
that builds a small index in a RAMDIrectory, with some code that
demonstrates what you expect to happen, and how it fails?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

How to get Document (or filename) from Span

Boris Galitsky-2
Thanks a lot Hoss

The question is when I get Spans, I get start/end positions and a
Document order (starting from 0), not the Document object itself from
which I could get a filename. Since I believe there is no way to get a
Document object from Spans, and there is no such thing as Document ID
in Lucene (right?) I attempt to have the same order for
Hits and for Spans (the indexing order) and retrieve Document for each
Spans this way.

I will try to prepare a test case. It works so far but I am afraid it
will be unstable.

Best regards
Boris



On Tue, 18 Apr 2006 10:29:30 -0700 (PDT)
  Chris Hostetter <[hidden email]> wrote:

>
> : For some reason, there is a disagreement between the order the
> : Documents are returned in hits, and the Documents are referenced
>(via
> : order number, starting from 0) in the Spans?
>
> When dealing with a Hits instance, documents are iterated over in
>"results
> order" -- which may be by score, or may be by some other sort you've
> specified.
>
> When dealing with a Spans instance, i believe the matches are
>iterated
> over in index order.  Besides the perofrmance reasosn why this may
> be true, you also have to keep in mind that the Spans instance has
>no
> idea what ordering you may have used when you executed your search
>-- even
> if it assumed you sorted by score, the SpanQuery may have been a
>part of a
> much larger more complicated query in which the final scores were
>vastly
> different.
>
> If i've missunderstood your problem, could you plee post a JUnit
>test case
> that builds a small index in a RAMDIrectory, with some code that
> demonstrates what you expect to happen, and how it fails?
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to get Document (or filename) from Span

Chris Hostetter-3

: The question is when I get Spans, I get start/end positions and a
: Document order (starting from 0), not the Document object itself from

Are you sure about that?  Spans.doc() should return you the internal
document Identifier which you can pass to indexReader.doc(int)

: which I could get a filename. Since I believe there is no way to get a
: Document object from Spans, and there is no such thing as Document ID
: in Lucene (right?) I attempt to have the same order for
: Hits and for Spans (the indexing order) and retrieve Document for each
: Spans this way.

Documents do have Document IDs, assigned based on index order.  that's
what Hits.id() returns.

FYI: take a look at the TestSpans.testSpanNearOrderedOverlap class for an
example of how the Spans class works. (it's what i'm using as a basis for
my suggestion as to how to use the class -- i've never used it myself)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to get Document (or filename) from Span

Grant Ingersoll
In reply to this post by Boris Galitsky-2
The doc() number can be given to IndexReader.document() to get the
Document, I believe.

Boris Galitsky wrote:

> Thanks a lot Hoss
>
> The question is when I get Spans, I get start/end positions and a
> Document order (starting from 0), not the Document object itself from
> which I could get a filename. Since I believe there is no way to get a
> Document object from Spans, and there is no such thing as Document ID
> in Lucene (right?) I attempt to have the same order for
> Hits and for Spans (the indexing order) and retrieve Document for each
> Spans this way.
>
> I will try to prepare a test case. It works so far but I am afraid it
> will be unstable.
>
> Best regards
> Boris
>
>
>
> On Tue, 18 Apr 2006 10:29:30 -0700 (PDT)
>  Chris Hostetter <[hidden email]> wrote:
>>
>> : For some reason, there is a disagreement between the order the
>> : Documents are returned in hits, and the Documents are referenced (via
>> : order number, starting from 0) in the Spans?
>>
>> When dealing with a Hits instance, documents are iterated over in
>> "results
>> order" -- which may be by score, or may be by some other sort you've
>> specified.
>>
>> When dealing with a Spans instance, i believe the matches are iterated
>> over in index order.  Besides the perofrmance reasosn why this may
>> be true, you also have to keep in mind that the Spans instance has no
>> idea what ordering you may have used when you executed your search --
>> even
>> if it assumed you sorted by score, the SpanQuery may have been a part
>> of a
>> much larger more complicated query in which the final scores were vastly
>> different.
>>
>> If i've missunderstood your problem, could you plee post a JUnit test
>> case
>> that builds a small index in a RAMDIrectory, with some code that
>> demonstrates what you expect to happen, and how it fails?
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to get Document (or filename) from Span

Boris Galitsky-2
In reply to this post by Chris Hostetter-3
I fully understand now. Thanks a lot
Boris

On Tue, 18 Apr 2006 11:10:20 -0700 (PDT)
  Chris Hostetter <[hidden email]> wrote:

>
> : The question is when I get Spans, I get start/end positions and a
> : Document order (starting from 0), not the Document object itself
>from
>
> Are you sure about that?  Spans.doc() should return you the internal
> document Identifier which you can pass to indexReader.doc(int)
>
> : which I could get a filename. Since I believe there is no way to
>get a
> : Document object from Spans, and there is no such thing as Document
>ID
> : in Lucene (right?) I attempt to have the same order for
> : Hits and for Spans (the indexing order) and retrieve Document for
>each
> : Spans this way.
>
> Documents do have Document IDs, assigned based on index order.
> that's
> what Hits.id() returns.
>
>FYI: take a look at the TestSpans.testSpanNearOrderedOverlap class
>for an
> example of how the Spans class works. (it's what i'm using as a
>basis for
> my suggestion as to how to use the class -- i've never used it
>myself)
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]