[jira] Created: (LUCENE-663) New feature rich higlighter for Lucene.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-663) New feature rich higlighter for Lucene.

JIRA jira@apache.org
New feature rich higlighter for Lucene.
---------------------------------------

                 Key: LUCENE-663
                 URL: http://issues.apache.org/jira/browse/LUCENE-663
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Search
            Reporter: Karel Tejnora
         Attachments: lucene-hlt-src.jar

Well, I refactored (took) some code from two previous highlighters.
This highlighter:
+ use TermPositionVector where available
+ use Analyzer if no TermPositionVector found or is forced to use it.
+ support for all lucene queries (Term, Phrase with slops, Prefix, Wildcard, Range) except Fuzzy Query (can be implemented easly)

- has no support for scoring (yet)
- use same prefix,postfix for accepted terms (yet)

? It's written in Java5

In next release I'd like to add support for Fuzzy, "coloring" f.e. diffrent color for terms btw. phrase terms (slops), scoring of fragments

It's apache licensed - I hope so :-) I put licene statement in every file


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-663) New feature rich higlighter for Lucene.

JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-663?page=comments#action_12429816 ]
           
Mark Harwood commented on LUCENE-663:
-------------------------------------

Hi Karel.
Many thanks for taking the time to make a contribution.

I would personally find it useful if you could describe your highlighter in terms of how it differs from existing implementations (the existing one in "contrib" and Ronnie Kolehmainen's recent contribution here: http://issues.apache.org/jira/browse/LUCENE-644?page=all ) . This would help us understand whether to consider this as an improvement to the existing approach or an alternative with different functionality.

I know for example that the existing contrib highlighter has all 3 of the functions you list as features (TermPositionVector/Analyzer support and support for all Lucene queries).

The sorts of improvement I can think of would be if your solution was
a) faster
b) a lighter memory footprint
c) able to highlight span/phrase matches correctly
d) simpler to use

So can you clarify what your motivations were and where you see the main differences/improvements over existing code?

Thanks again,
Mark

> New feature rich higlighter for Lucene.
> ---------------------------------------
>
>                 Key: LUCENE-663
>                 URL: http://issues.apache.org/jira/browse/LUCENE-663
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karel Tejnora
>         Attachments: lucene-hlt-src.jar
>
>
> Well, I refactored (took) some code from two previous highlighters.
> This highlighter:
> + use TermPositionVector where available
> + use Analyzer if no TermPositionVector found or is forced to use it.
> + support for all lucene queries (Term, Phrase with slops, Prefix, Wildcard, Range) except Fuzzy Query (can be implemented easly)
> - has no support for scoring (yet)
> - use same prefix,postfix for accepted terms (yet)
> ? It's written in Java5
> In next release I'd like to add support for Fuzzy, "coloring" f.e. diffrent color for terms btw. phrase terms (slops), scoring of fragments
> It's apache licensed - I hope so :-) I put licene statement in every file

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-663) New feature rich higlighter for Lucene.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-663?page=comments#action_12429848 ]
           
Karel Tejnora commented on LUCENE-663:
--------------------------------------

Hi,
yes as I  wrote in the code and keeps author - I borrow small code parts from this contribution http://issues.apache.org/jira/browse/LUCENE-644?page=all 
(where is a small bug when term is on or near to end of field - change lines 321:sb.append(cbuf, 0, EOF ? skip : (surround - skippedChars));  
276:int readed = reader.read(cbuf, 0, nextStart - pos); 278:sb.append(cbuf,0,readed);
also from WildcardTermEnum.

Motivation - I was unable to find a highlighter with good performance and proper phrase highlight (at beginning I needed just phrase with slop 0).

This highlighter results highlight for query "karel drinks beer"~4 on text karel drinks a lot of beers. Beer is his life. -> <SUFFIX>karel</SUFFIX> <PREFIX>drinks<SUFFIX> a lot of  czech <PREFIX>beer</SUFFIX>. Beer is his life.

I started to implement a stack for phrase query - end up with this.  Still it is not final, fuzzy, span,scoring and coloring needs to be done.
I mean 'Coloring':
<PREFIX>karel</SUFFIX> <PREFIX>drinks<SUFFIX> a <PREFIX1>lot</SUFFIX1> of  <PREFIX1>czech</SUFFIX1> <PREFIX>beer</SUFFIX>. Beer is his life.

for wild card BMW* -> <PREFIX>BMW</SUFFIX><PREFIX>ED</SUFFIX1>
etc.

So user can see why document matches his query.

Usage is maybe more straightforward:

Constructs Highlighter where all passed fields will be highlighted using TermPositionVector (where is not tpv null is returned)

FulltextHighlighter highlighter = new FulltextHighlighter(reader,query,prefix,suffix);

OR
Constructs Highlighter where all fields with highlight will be highlighted using Analyzer

FulltextHighlighter highlighter = new FulltextHighlighter(analyzer,query,prefix,suffix);

Constructs Highlighter where analyzer or TermVector will be autodetected
FulltextHighlighter highlighter = new FulltextHighlighter(reader, analyzer,query,prefix,suffix);

And when iterating hits:
String higlightedText = highlighter.highlight(luceneDocumentID, luceneDocument, fieldName);  // To use tpv

OR
String higlightedText = highlighter.highlight(luceneDocument, fieldName);  // To use analyzer, if tpv usage is forced assert reacts

it has some options:
setAnalyzerUnstable(boolean analyzerUnstable)  set it false (default true) if you know that Token t(n).startOffset() < t(n+1).startOffset
setMaxFragments(int i); max fragmets
setSurround(int surround);

a) b) I don't know maybe it will be faster or lighter or none from both but I began because none from contributed and issued give 'nice' results.
Im using a lot queries to search names like "James Bond" OR "Sean Connery" a this gives me nicer view why the document matches my query.

:-) Or I don't know how to use google

> New feature rich higlighter for Lucene.
> ---------------------------------------
>
>                 Key: LUCENE-663
>                 URL: http://issues.apache.org/jira/browse/LUCENE-663
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karel Tejnora
>         Attachments: lucene-hlt-src.jar
>
>
> Well, I refactored (took) some code from two previous highlighters.
> This highlighter:
> + use TermPositionVector where available
> + use Analyzer if no TermPositionVector found or is forced to use it.
> + support for all lucene queries (Term, Phrase with slops, Prefix, Wildcard, Range) except Fuzzy Query (can be implemented easly)
> - has no support for scoring (yet)
> - use same prefix,postfix for accepted terms (yet)
> ? It's written in Java5
> In next release I'd like to add support for Fuzzy, "coloring" f.e. diffrent color for terms btw. phrase terms (slops), scoring of fragments
> It's apache licensed - I hope so :-) I put licene statement in every file

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-663) New feature rich higlighter for Lucene.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-663?page=comments#action_12430480 ]
           
Ronnie Kolehmainen commented on LUCENE-663:
-------------------------------------------

Karel,

although tests passed your 3 line fixes indeed look valid, so the files in LUCENE-644 are updated accordingly. Thanks.



> New feature rich higlighter for Lucene.
> ---------------------------------------
>
>                 Key: LUCENE-663
>                 URL: http://issues.apache.org/jira/browse/LUCENE-663
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karel Tejnora
>         Attachments: lucene-hlt-src.jar
>
>
> Well, I refactored (took) some code from two previous highlighters.
> This highlighter:
> + use TermPositionVector where available
> + use Analyzer if no TermPositionVector found or is forced to use it.
> + support for all lucene queries (Term, Phrase with slops, Prefix, Wildcard, Range) except Fuzzy Query (can be implemented easly)
> - has no support for scoring (yet)
> - use same prefix,postfix for accepted terms (yet)
> ? It's written in Java5
> In next release I'd like to add support for Fuzzy, "coloring" f.e. diffrent color for terms btw. phrase terms (slops), scoring of fragments
> It's apache licensed - I hope so :-) I put licene statement in every file

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-663) New feature rich higlighter for Lucene.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-663?page=comments#action_12442203 ]
           
Karel Tejnora commented on LUCENE-663:
--------------------------------------


   [[ Old comment, sent by email on Wed, 23 Aug 2006 02:21:04 +0200 ]]

It is too late here...
to

on text karel drinks a lot of beers. Beer is his life.

should be :

on text karel drinks a lot of czech beers. Beer is his life.


(both are true... but second is better is all cases)

Well is has support for Fuzzy Term if fuzzy term returns terms from
query.extractTerms


> New feature rich higlighter for Lucene.
> ---------------------------------------
>
>                 Key: LUCENE-663
>                 URL: http://issues.apache.org/jira/browse/LUCENE-663
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karel Tejnora
>         Attachments: lucene-hlt-src.jar
>
>
> Well, I refactored (took) some code from two previous highlighters.
> This highlighter:
> + use TermPositionVector where available
> + use Analyzer if no TermPositionVector found or is forced to use it.
> + support for all lucene queries (Term, Phrase with slops, Prefix, Wildcard, Range) except Fuzzy Query (can be implemented easly)
> - has no support for scoring (yet)
> - use same prefix,postfix for accepted terms (yet)
> ? It's written in Java5
> In next release I'd like to add support for Fuzzy, "coloring" f.e. diffrent color for terms btw. phrase terms (slops), scoring of fragments
> It's apache licensed - I hope so :-) I put licene statement in every file

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-663) New feature rich higlighter for Lucene.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-663?page=comments#action_12447678 ]
           
[hidden email] commented on LUCENE-663:
-------------------------------------------------------


   [[ Old comment, sent from unregistered email on Mon, 23 Oct 2006 04:09:23 -0700 ]]

Hello Karen,

I want to use the new Highlighter api for matching only phrases and not the individual tokens in the phrase. Can you please tell me where I can find the source for new highlighter api?

Thanks,
Harini

Quoted from:
http://www.nabble.com/-jira--Created%3A-%28LUCENE-663%29-New-feature-rich-higlighter-for-Lucene.-tf2147495.html#a5929597



> New feature rich higlighter for Lucene.
> ---------------------------------------
>
>                 Key: LUCENE-663
>                 URL: http://issues.apache.org/jira/browse/LUCENE-663
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karel Tejnora
>         Attachments: lucene-hlt-src.jar
>
>
> Well, I refactored (took) some code from two previous highlighters.
> This highlighter:
> + use TermPositionVector where available
> + use Analyzer if no TermPositionVector found or is forced to use it.
> + support for all lucene queries (Term, Phrase with slops, Prefix, Wildcard, Range) except Fuzzy Query (can be implemented easly)
> - has no support for scoring (yet)
> - use same prefix,postfix for accepted terms (yet)
> ? It's written in Java5
> In next release I'd like to add support for Fuzzy, "coloring" f.e. diffrent color for terms btw. phrase terms (slops), scoring of fragments
> It's apache licensed - I hope so :-) I put licene statement in every file

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]