Counting hits in a document

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Counting hits in a document

Erick Erickson
Hi again.

I've been struggling for the last couple of days and getting nowhere, so
it's time to swallow my pride and say "Help"....

OK, let's say I have a document indexed and I do NOT have access to the raw
text. I need to find the offset of all the hits for a query on a single
document. Advice was offered a while ago to use getSpans from a spanquery,
but for the life of me I don't see how to make this work. As I remember,
Erik was talking about rewriting the original query as a set of spans.

The trouble I'm having is that I sure don't see how to rewrite the standard
query as a span query, then feed that back into my index for a particular
document (that I have a unique ID for). It seems that the getSpans looks
through my entire index, which is *probably* prohibitive.

I can make each part of the query into a SpanTermQuery. I can assemble these
together into a bunch of nested span queries. At the end of this, I have a
single Span query that I can call getSpans on. But what now? I don't see how
the spans relate to the document I need to focus on. From what I see of the
Spans interface, it's intended to look at the entire index rather than be
confined to a subset of the documents (in this case, exactly one.
Guaranteed).

I've thought about putting the documentID in a MUST clause of a
BooleanQuery, and adding my span query to that, but it doesn't look like
getSpans does me any good there.

I looked at the SrndQuery family and don't see anything there that lets me
get the offsets of my matches.

I don't have the text, so I can't highlight all the hits and count.

The code I've been writing feels like the wrong solution to the wrong
problem at the wrong time. Given that I know the document ID on the way in,
is my best bet to roll my own? That is, enumerate the relevant terms in my
document and measure the distance between the terms and aggregate the
results myself? I'd rather not do that, of course, but can if necessary.

I *want* someone to say "just call <fill in magic method here>"....

Any help greatly appreciated...

Thanks
Erick
Reply | Threaded
Open this post in threaded view
|

Re: Counting hits in a document

Chris Hostetter-3

The Spans interface has a skipTo for jumping to a specific documentId (or
the first matching document with a higher documentId)
once you've done that, then the doc(), start(), and end() calls will tell
you info about the match (which doc it's in, where that match starts, nd
where it ends) ... use next() to advance tothe next match -- if doc()
returns the same number as before, then you've got two matches in one
document, if doc returns a new number, you have finished with the matches
in that document, and moved on to the next matching document.

the key to all of this is that if you want to find docs matching "+FOO
+BAR" and then find where exactly the instances of FOO and BAR are you can
do one query to get the matching docIds in whatever order you want, then
use seperate SpamTermQUeries to find where exactly each term is.


: Date: Thu, 18 Jan 2007 16:06:57 -0500
: From: Erick Erickson <[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: Counting hits in a document
:
: Hi again.
:
: I've been struggling for the last couple of days and getting nowhere, so
: it's time to swallow my pride and say "Help"....
:
: OK, let's say I have a document indexed and I do NOT have access to the raw
: text. I need to find the offset of all the hits for a query on a single
: document. Advice was offered a while ago to use getSpans from a spanquery,
: but for the life of me I don't see how to make this work. As I remember,
: Erik was talking about rewriting the original query as a set of spans.
:
: The trouble I'm having is that I sure don't see how to rewrite the standard
: query as a span query, then feed that back into my index for a particular
: document (that I have a unique ID for). It seems that the getSpans looks
: through my entire index, which is *probably* prohibitive.
:
: I can make each part of the query into a SpanTermQuery. I can assemble these
: together into a bunch of nested span queries. At the end of this, I have a
: single Span query that I can call getSpans on. But what now? I don't see how
: the spans relate to the document I need to focus on. From what I see of the
: Spans interface, it's intended to look at the entire index rather than be
: confined to a subset of the documents (in this case, exactly one.
: Guaranteed).
:
: I've thought about putting the documentID in a MUST clause of a
: BooleanQuery, and adding my span query to that, but it doesn't look like
: getSpans does me any good there.
:
: I looked at the SrndQuery family and don't see anything there that lets me
: get the offsets of my matches.
:
: I don't have the text, so I can't highlight all the hits and count.
:
: The code I've been writing feels like the wrong solution to the wrong
: problem at the wrong time. Given that I know the document ID on the way in,
: is my best bet to roll my own? That is, enumerate the relevant terms in my
: document and measure the distance between the terms and aggregate the
: results myself? I'd rather not do that, of course, but can if necessary.
:
: I *want* someone to say "just call <fill in magic method here>"....
:
: Any help greatly appreciated...
:
: Thanks
: Erick
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Counting hits in a document

Mark Miller-3
In reply to this post by Erick Erickson
Just threw together a highlighter that can handle spans (combining a
rewrite with dumspans from LIA) and used this:
http://issues.apache.org/bugzilla/attachment.cgi?id=15568

Nice spans extractor from Mark (not me <G>). Give it a query it will
give you the spans.

- Mark

Erick Erickson wrote:

> Hi again.
>
> I've been struggling for the last couple of days and getting nowhere, so
> it's time to swallow my pride and say "Help"....
>
> OK, let's say I have a document indexed and I do NOT have access to
> the raw
> text. I need to find the offset of all the hits for a query on a single
> document. Advice was offered a while ago to use getSpans from a
> spanquery,
> but for the life of me I don't see how to make this work. As I remember,
> Erik was talking about rewriting the original query as a set of spans.
>
> The trouble I'm having is that I sure don't see how to rewrite the
> standard
> query as a span query, then feed that back into my index for a particular
> document (that I have a unique ID for). It seems that the getSpans looks
> through my entire index, which is *probably* prohibitive.
>
> I can make each part of the query into a SpanTermQuery. I can assemble
> these
> together into a bunch of nested span queries. At the end of this, I
> have a
> single Span query that I can call getSpans on. But what now? I don't
> see how
> the spans relate to the document I need to focus on. From what I see
> of the
> Spans interface, it's intended to look at the entire index rather than be
> confined to a subset of the documents (in this case, exactly one.
> Guaranteed).
>
> I've thought about putting the documentID in a MUST clause of a
> BooleanQuery, and adding my span query to that, but it doesn't look like
> getSpans does me any good there.
>
> I looked at the SrndQuery family and don't see anything there that
> lets me
> get the offsets of my matches.
>
> I don't have the text, so I can't highlight all the hits and count.
>
> The code I've been writing feels like the wrong solution to the wrong
> problem at the wrong time. Given that I know the document ID on the
> way in,
> is my best bet to roll my own? That is, enumerate the relevant terms
> in my
> document and measure the distance between the terms and aggregate the
> results myself? I'd rather not do that, of course, but can if necessary.
>
> I *want* someone to say "just call <fill in magic method here>"....
>
> Any help greatly appreciated...
>
> Thanks
> Erick
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Counting hits in a document

Erick Erickson
Hoss:

It was late this afternooon and I was square-eyed, so I didn't add the
detail. The app we're working on first returns a summary list of all the
books that match a query, no hit information. Next, the user clicks on a
returned title and we show the hits by chapter. That is, a list of chapters
and the count of the hits for each. The index is nearing 15G at present, so
I *assumed* that I really didn't want to re-query the entire index when I
know the particular document I care about already. But what do I know?

Mark:

Very most excellent. I'll give it a look in the morning. I hope that the
class doesn't need the raw text since I don't have it any more, but your
comment "Give it a query it will give you the spans" makes me hopeful.



The real issue is that it looks like I'm reverting to my old "C" days. The
code I was writing the last couple of days started to look like a program
from...well...a long time ago. So I *know* it must be wrong <G>...... It's a
real pain in the neck to *think* in Java terms when much of my training was
before this new-fangled way of looking at programming problems happened. I
suppose I could go into management, but that would be giving in to the dark
side....

Thanks all
Erick


On 1/18/07, Mark Miller <[hidden email]> wrote:

>
> Just threw together a highlighter that can handle spans (combining a
> rewrite with dumspans from LIA) and used this:
> http://issues.apache.org/bugzilla/attachment.cgi?id=15568
>
> Nice spans extractor from Mark (not me <G>). Give it a query it will
> give you the spans.
>
> - Mark
>
> Erick Erickson wrote:
> > Hi again.
> >
> > I've been struggling for the last couple of days and getting nowhere, so
> > it's time to swallow my pride and say "Help"....
> >
> > OK, let's say I have a document indexed and I do NOT have access to
> > the raw
> > text. I need to find the offset of all the hits for a query on a single
> > document. Advice was offered a while ago to use getSpans from a
> > spanquery,
> > but for the life of me I don't see how to make this work. As I remember,
> > Erik was talking about rewriting the original query as a set of spans.
> >
> > The trouble I'm having is that I sure don't see how to rewrite the
> > standard
> > query as a span query, then feed that back into my index for a
> particular
> > document (that I have a unique ID for). It seems that the getSpans looks
> > through my entire index, which is *probably* prohibitive.
> >
> > I can make each part of the query into a SpanTermQuery. I can assemble
> > these
> > together into a bunch of nested span queries. At the end of this, I
> > have a
> > single Span query that I can call getSpans on. But what now? I don't
> > see how
> > the spans relate to the document I need to focus on. From what I see
> > of the
> > Spans interface, it's intended to look at the entire index rather than
> be
> > confined to a subset of the documents (in this case, exactly one.
> > Guaranteed).
> >
> > I've thought about putting the documentID in a MUST clause of a
> > BooleanQuery, and adding my span query to that, but it doesn't look like
> > getSpans does me any good there.
> >
> > I looked at the SrndQuery family and don't see anything there that
> > lets me
> > get the offsets of my matches.
> >
> > I don't have the text, so I can't highlight all the hits and count.
> >
> > The code I've been writing feels like the wrong solution to the wrong
> > problem at the wrong time. Given that I know the document ID on the
> > way in,
> > is my best bet to roll my own? That is, enumerate the relevant terms
> > in my
> > document and measure the distance between the terms and aggregate the
> > results myself? I'd rather not do that, of course, but can if necessary.
> >
> > I *want* someone to say "just call <fill in magic method here>"....
> >
> > Any help greatly appreciated...
> >
> > Thanks
> > Erick
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Counting hits in a document

Mark Miller-3

>
> Mark:
>
> Very most excellent. I'll give it a look in the morning. I hope that the
> class doesn't need the raw text since I don't have it any more, but your
> comment "Give it a query it will give you the spans" makes me hopeful.
>
Should have been more specific: Just give it a query and an appropriate
IndexReader <g>. No source text needed. Hope it works for you. Been a
real boon for me.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Counting hits in a document

Chris Hostetter-3
In reply to this post by Erick Erickson

: It was late this afternooon and I was square-eyed, so I didn't add the
: detail. The app we're working on first returns a summary list of all the
: books that match a query, no hit information. Next, the user clicks on a
: returned title and we show the hits by chapter. That is, a list of chapters
: and the count of the hits for each. The index is nearing 15G at present, so
: I *assumed* that I really didn't want to re-query the entire index when I
: know the particular document I care about already. But what do I know?

i never said anything about requerying the whole index, i said skipTo the
docid you care about...

one the second user click, figure out the docid (do a TermQuery or
an indexReader.termDocs on a Term containing whatever unique id you have
for each title) then do something like this using whatever SpanQuery you
want (it doesn't have to be your orriginal SpanQuery, it could be a
SpanTermQuery that was part of your larger SpanQuery) ...

    SpanQuery whatever = ...
    Spans s = whatever.getSpans(indexReader)
    s.skipTo(yourDocId);
    while (s.doc() == yourDocId) {
      print("match between " + s.start() + " and " + s.end());
      s.next();
    }


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Counting hits in a document

Paul Elschot
Adding a few details:

On Friday 19 January 2007 06:42, Chris Hostetter wrote:
>
>
>     SpanQuery whatever = ...
>     Spans s = whatever.getSpans(indexReader)
     if (!s.skipTo(yourDocId)) {
        ... // no match
      } else {
>     while (s.doc() == yourDocId) {
>       print("match between " + s.start() + " and " + s.end());
         if (! s.next()) break;
>     }
    }

For performance, make sure not to go the the same disk the index is on
while using the spans like this.

In case you have multiple docs to treat, skip to them in increasing docId
order using the same spans.

And if you ever want to write a Scorer, just add more details...

Regards,
Paul  Eschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]