fastest way to gather simple terms that match documents?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

fastest way to gather simple terms that match documents?

Jason Eacott-2
Hi all,
    After I've run a query I need to know which terms matched each
result document (ie doc termfrequency>0).
the only way I know to do this is by calling explain on each document,
which the documentation claims to be
almost the equivalent of a new query for each call so I'm keen to
avoid that option if possible.
Is there a quick way to discover this information? All I need is a
list of terms (as simple strings would be fine),
I don't care how many were found or what position or anything else.
just which ones matched.

thoughts?

Thanks
Jason.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: fastest way to gather simple terms that match documents?

Chris Hostetter-3

:     After I've run a query I need to know which terms matched each
: result document (ie doc termfrequency>0).
        ...
: I don't care how many were found or what position or anything else.
: just which ones matched.

if all you care about is simple "which terms does it have" you can take
your list of terms, and your list of docids, sort both lists and then use
termDocs to loop over the terms and over the docs.  (the sorting is key
for performance, because it allways you to alwasy skip forward, w/o
needing to restart the termDocs)

something like...

TermDocs iter = indexReader.termDocs();
for (Term t : myTerms) {
  iter.seek(t);
  for (int docid : myDocs) {
    if (iter.skipTo(docid) && (iter.doc() == docid)) {
      doSomethingWith(t, docid);
    }
  }
}



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: fastest way to gather simple terms that match documents?

Uwe Schindler
Alternatively index your documents with term vectors for the field enabled:

http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/document/Field.TermVector.html

And then use IndexReader.getTermFreqVector() with the matching doc ID:

http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/index/IndexReader.html#getTermFreqVector(int, java.lang.String)

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: Chris Hostetter [mailto:[hidden email]]
> Sent: Monday, April 05, 2010 8:24 PM
> To: [hidden email]
> Subject: Re: fastest way to gather simple terms that match documents?
>
>
> :     After I've run a query I need to know which terms matched each
> : result document (ie doc termfrequency>0).
> ...
> : I don't care how many were found or what position or anything else.
> : just which ones matched.
>
> if all you care about is simple "which terms does it have" you can take
> your list of terms, and your list of docids, sort both lists and then
> use
> termDocs to loop over the terms and over the docs.  (the sorting is key
> for performance, because it allways you to alwasy skip forward, w/o
> needing to restart the termDocs)
>
> something like...
>
> TermDocs iter = indexReader.termDocs();
> for (Term t : myTerms) {
>   iter.seek(t);
>   for (int docid : myDocs) {
>     if (iter.skipTo(docid) && (iter.doc() == docid)) {
>       doSomethingWith(t, docid);
>     }
>   }
> }
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: fastest way to gather simple terms that match documents?

Chris Hostetter-3

: Alternatively index your documents with term vectors for the field enabled:
        ...
: And then use IndexReader.getTermFreqVector() with the matching doc ID:

Uwe: this is an area i'm not particularly strong on, so i'm curious: do
you expect that the TermFreqVector approach would be faster then the
TermDocs approach for the type of usecase where docs tend to be "large"
but the list of specific terms you are interested in in testing for is
"small" (ie: just the terms used in the original query)

I ask because off the top of my head i'm not seeing how it
would really give you much of a time savings in return -- instead of
seeking over the handful of terms you care about, the TermVectorMapper
will have to scan over every Term in each of hte documents.  writing your
own TermVectorMapper that ignores the terms you don't care about will
help, but that still doesn't sound any faster)

: > :     After I've run a query I need to know which terms matched each
: > : result document (ie doc termfrequency>0).
: > ...
: > : I don't care how many were found or what position or anything else.
: > : just which ones matched.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]