Getting tokens from search results. Simple concept

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting tokens from search results. Simple concept

HPDrifter
When I get a search result based on my index, I need the exact tokens which were identified in the index as part of the result.  Why?  I need the character offsets.

I have a solution right now...almost, but it bugs the hell out of me that I can say something like...
documentHit[0].getIdentifiedTokens();

Do I need to make a contribution in order to make this happen?

Reply | Threaded
Open this post in threaded view
|

Re: Getting tokens from search results. Simple concept

Erik Hatcher
Have you looked at the contrib Highlighter?  Or using an Analyzer  
directly to give you the offsets?

        Erik

On Feb 26, 2009, at 9:32 AM, HPDrifter wrote:

>
> When I get a search result based on my index, I need the exact  
> tokens which
> were identified in the index as part of the result.  Why?  I need the
> character offsets.
>
> I have a solution right now...almost, but it bugs the hell out of me  
> that I
> can say something like...
> documentHit[0].getIdentifiedTokens();
>
> Do I need to make a contribution in order to make this happen?:ninja:
>
>
> --
> View this message in context: http://www.nabble.com/Getting-tokens-from-search-results.--Simple-concept-tp22225364p22225364.html
> Sent from the Lucene - Java Developer mailing list archive at  
> Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting tokens from search results. Simple concept

HPDrifter
Yes, I have but it is too memory intensive.   I used highlighter as my first attempt but it was not a good solution because, I have to send the entire text to highlighter.

What I did instead is similar to your suggestion.  
1. use the analyzer to return me a token stream.
2. search the token stream for the keyword I'm looking for (need to analyze that keyword as well!)
3. extract the token's offset.
4. use the offsets in the index and Java's RandomFileArray to "seek" the byte(character) position then extract a "fragment" of about 500 chars around that index.

This solution requires little memory use and, I hope, will work as I expect under steady stress.

How does this sound to you?

What I would LOVE is if I could do it in a standard Lucene search like I mentioned earlier.
Hit.doc[0].getHitTokenList()
Something like this...

~Dustin



Erik Hatcher wrote
Have you looked at the contrib Highlighter?  Or using an Analyzer  
directly to give you the offsets?

        Erik

On Feb 26, 2009, at 9:32 AM, HPDrifter wrote:

>
> When I get a search result based on my index, I need the exact  
> tokens which
> were identified in the index as part of the result.  Why?  I need the
> character offsets.
>
> I have a solution right now...almost, but it bugs the hell out of me  
> that I
> can say something like...
> documentHit[0].getIdentifiedTokens();
>
> Do I need to make a contribution in order to make this happen?:ninja:
>
>
> --
> View this message in context: http://www.nabble.com/Getting-tokens-from-search-results.--Simple-concept-tp22225364p22225364.html
> Sent from the Lucene - Java Developer mailing list archive at  
> Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Getting tokens from search results. Simple concept

hossman

: What I would LOVE is if I could do it in a standard Lucene search like I
: mentioned earlier.
: Hit.doc[0].getHitTokenList() :confused:
: Something like this...

The Query/Scorer APIs don't provide any mechanism for information like
that to be conveyed back up the call chain -- mainly because it's more
heavy weight then most people need.

If you have custom Query/Scorer implementations, you can keep track of
whatever state you want when executing a QUery -- in fact the SpanQuery
family of queries do keep track of exactly the type of info you seem to
want, and after executing a query, you can ask it for the "Spans" of any
matching document -- the down side is the a loss in performance of query
execution (because it takes time/memory to keep track of all the matches)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting tokens from search results. Simple concept

HPDrifter
Thanks for your thoughts!  This, at least, makes me feel sane.


hossman wrote
: What I would LOVE is if I could do it in a standard Lucene search like I
: mentioned earlier.
: Hit.doc[0].getHitTokenList() :confused:
: Something like this...

The Query/Scorer APIs don't provide any mechanism for information like
that to be conveyed back up the call chain -- mainly because it's more
heavy weight then most people need.

If you have custom Query/Scorer implementations, you can keep track of
whatever state you want when executing a QUery -- in fact the SpanQuery
family of queries do keep track of exactly the type of info you seem to
want, and after executing a query, you can ask it for the "Spans" of any
matching document -- the down side is the a loss in performance of query
execution (because it takes time/memory to keep track of all the matches)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Getting tokens from search results. Simple concept

Mike Klaas
In reply to this post by hossman
On 5-Mar-09, at 2:42 PM, Chris Hostetter wrote:

>
> : What I would LOVE is if I could do it in a standard Lucene search  
> like I
> : mentioned earlier.
> : Hit.doc[0].getHitTokenList() :confused:
> : Something like this...
>
> The Query/Scorer APIs don't provide any mechanism for information like
> that to be conveyed back up the call chain -- mainly because it's more
> heavy weight then most people need.
>
> If you have custom Query/Scorer implementations, you can keep track of
> whatever state you want when executing a QUery -- in fact the  
> SpanQuery
> family of queries do keep track of exactly the type of info you seem  
> to
> want, and after executing a query, you can ask it for the "Spans" of  
> any
> matching document -- the down side is the a loss in performance of  
> query
> execution (because it takes time/memory to keep track of all the  
> matches)

Even then, if I'm not mistaken, spans track token _positions_, not  
_offsets_ in the original string.

A reverse text index like lucene is fast precisely because it doesn't  
have to keep track of this information.  I think the best alternative  
might be to use termvectors, which are essentially a cache of the  
analyzed tokens for a document.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting tokens from search results. Simple concept

Michael McCandless-2

Mike Klaas wrote:

> On 5-Mar-09, at 2:42 PM, Chris Hostetter wrote:
>
>>
>> : What I would LOVE is if I could do it in a standard Lucene search  
>> like I
>> : mentioned earlier.
>> : Hit.doc[0].getHitTokenList() :confused:
>> : Something like this...
>>
>> The Query/Scorer APIs don't provide any mechanism for information  
>> like
>> that to be conveyed back up the call chain -- mainly because it's  
>> more
>> heavy weight then most people need.
>>
>> If you have custom Query/Scorer implementations, you can keep track  
>> of
>> whatever state you want when executing a QUery -- in fact the  
>> SpanQuery
>> family of queries do keep track of exactly the type of info you  
>> seem to
>> want, and after executing a query, you can ask it for the "Spans"  
>> of any
>> matching document -- the down side is the a loss in performance of  
>> query
>> execution (because it takes time/memory to keep track of all the  
>> matches)
>
> Even then, if I'm not mistaken, spans track token _positions_, not  
> _offsets_ in the original string.

That's correct.

> A reverse text index like lucene is fast precisely because it  
> doesn't have to keep track of this information.

One option is to stuff the offsets into payloads, and then make a  
custom Query that decodes the offsets from the payload, and store it  
away when collecting hits.

> I think the best alternative might be to use termvectors, which are  
> essentially a cache of the analyzed tokens for a document.

Another way to think of term vectors is a single-document inverted  
index that you can retrieve in entirety.  Ie, it maps terms to their  
occurrences (count, positions, offsets) within the document.

I agree, term vectors should work for this.

I don't really understand, though, why the highlighter package doesn't  
work here -- it also just re-analyzes the text, when it can't find  
term vectors.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting tokens from search results. Simple concept

Grant Ingersoll-2
In reply to this post by HPDrifter
In contrib/analysis, is a payload Analyzer  
(TokenOffsetPayloadTokenFilter) that adds the offsets to each token as  
a Payload.  Then, you can use SpanQuery.getPayloadSpans() at query  
time to retrieve the payload information and decode the payload.

HTH,
Grant

On Feb 26, 2009, at 9:32 AM, HPDrifter wrote:

>
> When I get a search result based on my index, I need the exact  
> tokens which
> were identified in the index as part of the result.  Why?  I need the
> character offsets.
>
> I have a solution right now...almost, but it bugs the hell out of me  
> that I
> can say something like...
> documentHit[0].getIdentifiedTokens();
>
> Do I need to make a contribution in order to make this happen?:ninja:
>
>
> --
> View this message in context: http://www.nabble.com/Getting-tokens-from-search-results.--Simple-concept-tp22225364p22225364.html
> Sent from the Lucene - Java Developer mailing list archive at  
> Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Getting tokens from search results. Simple concept

HPDrifter
This maybe what I'm looking for.  I'll follow up asap.

Grant Ingersoll-6 wrote
In contrib/analysis, is a payload Analyzer  
(TokenOffsetPayloadTokenFilter) that adds the offsets to each token as  
a Payload.  Then, you can use SpanQuery.getPayloadSpans() at query  
time to retrieve the payload information and decode the payload.

HTH,
Grant

On Feb 26, 2009, at 9:32 AM, HPDrifter wrote:

>
> When I get a search result based on my index, I need the exact  
> tokens which
> were identified in the index as part of the result.  Why?  I need the
> character offsets.
>
> I have a solution right now...almost, but it bugs the hell out of me  
> that I
> can say something like...
> documentHit[0].getIdentifiedTokens();
>
> Do I need to make a contribution in order to make this happen?:ninja:
>
>
> --
> View this message in context: http://www.nabble.com/Getting-tokens-from-search-results.--Simple-concept-tp22225364p22225364.html
> Sent from the Lucene - Java Developer mailing list archive at  
> Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org