[jira] Created: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

Igor Motov (Jira)
LargeDocHighlighter - another span highlighter optimized for large documents
----------------------------------------------------------------------------

                 Key: LUCENE-1286
                 URL: https://issues.apache.org/jira/browse/LUCENE-1286
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/highlighter
    Affects Versions: 2.4
            Reporter: Mark Miller


The existing Highlighter API is rich and well designed, but the approach taken is not very efficient for large documents.

I believe that this is because the current Highlighter rebuilds the document by running through and scoring every every token in the tokenstream.

With a break in the current API, an alternate approach can be taken: rebuild the document by running through the query terms by using their offsets. The benefit is clear - a large doc will have a large tokenstream, but a query will likely be very small in comparison.

I expect this approach to be quite a bit faster for very large documents, while still supporting Phrase and Span queries.

First rough patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

Igor Motov (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1286:
--------------------------------

    Priority: Minor  (was: Major)

> LargeDocHighlighter - another span highlighter optimized for large documents
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1286
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>    Affects Versions: 2.4
>            Reporter: Mark Miller
>            Priority: Minor
>
> The existing Highlighter API is rich and well designed, but the approach taken is not very efficient for large documents.
> I believe that this is because the current Highlighter rebuilds the document by running through and scoring every every token in the tokenstream.
> With a break in the current API, an alternate approach can be taken: rebuild the document by running through the query terms by using their offsets. The benefit is clear - a large doc will have a large tokenstream, but a query will likely be very small in comparison.
> I expect this approach to be quite a bit faster for very large documents, while still supporting Phrase and Span queries.
> First rough patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

Igor Motov (Jira)
In reply to this post by Igor Motov (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646177#action_12646177 ]

Koji Sekiguchi commented on LUCENE-1286:
----------------------------------------

bq. First rough patch to follow shortly.

Mark,
I'm very interested in this. How is it going on?

> LargeDocHighlighter - another span highlighter optimized for large documents
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1286
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>    Affects Versions: 2.4
>            Reporter: Mark Miller
>            Priority: Minor
>
> The existing Highlighter API is rich and well designed, but the approach taken is not very efficient for large documents.
> I believe that this is because the current Highlighter rebuilds the document by running through and scoring every every token in the tokenstream.
> With a break in the current API, an alternate approach can be taken: rebuild the document by running through the query terms by using their offsets. The benefit is clear - a large doc will have a large tokenstream, but a query will likely be very small in comparison.
> I expect this approach to be quite a bit faster for very large documents, while still supporting Phrase and Span queries.
> First rough patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

Igor Motov (Jira)
In reply to this post by Igor Motov (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646266#action_12646266 ]

Mark Miller commented on LUCENE-1286:
-------------------------------------

It didn't turn out as well as I had hoped. You had to pay too much for the Memory index / getting spans. I havn't closed the issue because I hope to keep trying, but I don't have anything great at the moment. If I have the time I'll get back into it though. Storing position/offset termvectors is the only helpful thing for large docs that I know of at the moment.

There is another highlighter by Ronnie something in JIRA that also takes this approach but  without Phrase/Span support and requiring stored termvectors. You might try it though.

> LargeDocHighlighter - another span highlighter optimized for large documents
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1286
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>    Affects Versions: 2.4
>            Reporter: Mark Miller
>            Priority: Minor
>
> The existing Highlighter API is rich and well designed, but the approach taken is not very efficient for large documents.
> I believe that this is because the current Highlighter rebuilds the document by running through and scoring every every token in the tokenstream.
> With a break in the current API, an alternate approach can be taken: rebuild the document by running through the query terms by using their offsets. The benefit is clear - a large doc will have a large tokenstream, but a query will likely be very small in comparison.
> I expect this approach to be quite a bit faster for very large documents, while still supporting Phrase and Span queries.
> First rough patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

Igor Motov (Jira)
In reply to this post by Igor Motov (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651452#action_12651452 ]

Koji Sekiguchi commented on LUCENE-1286:
----------------------------------------

Thanks, Mark. I've tryed Ronnie's patch in LUCENE-644. It was great in speed, but phrase support is needed in our project.

So, I'd like to know your approach mentioned in above description. Can you elaborate this - "rebuild the document by running through the query terms by using their offsets"?

> LargeDocHighlighter - another span highlighter optimized for large documents
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1286
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>    Affects Versions: 2.4
>            Reporter: Mark Miller
>            Priority: Minor
>
> The existing Highlighter API is rich and well designed, but the approach taken is not very efficient for large documents.
> I believe that this is because the current Highlighter rebuilds the document by running through and scoring every every token in the tokenstream.
> With a break in the current API, an alternate approach can be taken: rebuild the document by running through the query terms by using their offsets. The benefit is clear - a large doc will have a large tokenstream, but a query will likely be very small in comparison.
> I expect this approach to be quite a bit faster for very large documents, while still supporting Phrase and Span queries.
> First rough patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

Igor Motov (Jira)
In reply to this post by Igor Motov (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653283#action_12653283 ]

Mark Miller commented on LUCENE-1286:
-------------------------------------

Hey Koji, I actually have some ideas to come back to this with, but no time for some time to actually work on it.

bq. Can you elaborate this - "rebuild the document by running through the query terms by using their offsets"?

Part of the problem with the Highlighter and large docs is that it runs through every token in the doc and scores that token, building the original highlighted doc as it goes. For a large doc, that can be a bit slow. What Ronnies highlighter did was just look at the offsets of the query terms (hence the need for term vectors) which allows you to rebuild the original highlighted document in big quick chunks (stitching things together between query term offsets).

I was attempting a similar thing here with phrase and span support, but I couldn't match the speed of what the current Span highlighter has - this is because the current Span Highlighter can highlight non position sensitive terms very fast. My method required getting non position sensitive terms from the MemoryIndex as well (via getSpans) and the cost ruined any benefit. I came up with a few things to try since then but havn't had the time to dedicate to it yet. Its hard to get around requiring term vectors (for the offsets), and I'd like to avoid that. At the same time, if you don't require term vectors, its probably going to be pretty slow re-analyzing the documents anyway...

> LargeDocHighlighter - another span highlighter optimized for large documents
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1286
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>    Affects Versions: 2.4
>            Reporter: Mark Miller
>            Priority: Minor
>
> The existing Highlighter API is rich and well designed, but the approach taken is not very efficient for large documents.
> I believe that this is because the current Highlighter rebuilds the document by running through and scoring every every token in the tokenstream.
> With a break in the current API, an alternate approach can be taken: rebuild the document by running through the query terms by using their offsets. The benefit is clear - a large doc will have a large tokenstream, but a query will likely be very small in comparison.
> I expect this approach to be quite a bit faster for very large documents, while still supporting Phrase and Span queries.
> First rough patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

Igor Motov (Jira)
In reply to this post by Igor Motov (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller closed LUCENE-1286.
-------------------------------

    Resolution: Fixed

This isn't likely to go anywhere anytime soon - Koji's FastVectorHighlighter, while requiring termvectors, accomplishes this pretty nicely.

> LargeDocHighlighter - another span highlighter optimized for large documents
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1286
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>    Affects Versions: 2.4
>            Reporter: Mark Miller
>            Priority: Minor
>
> The existing Highlighter API is rich and well designed, but the approach taken is not very efficient for large documents.
> I believe that this is because the current Highlighter rebuilds the document by running through and scoring every every token in the tokenstream.
> With a break in the current API, an alternate approach can be taken: rebuild the document by running through the query terms by using their offsets. The benefit is clear - a large doc will have a large tokenstream, but a query will likely be very small in comparison.
> I expect this approach to be quite a bit faster for very large documents, while still supporting Phrase and Span queries.
> First rough patch to follow shortly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]