Better highlighting fragmenter

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Better highlighting fragmenter

Mike Klaas
I've written an unpolished custom fragmenter for highlighting which is
more expensive than the BasicFragmenter that ships with lucene, but
generates more natural candidate fragments (it will tend to produce
beginning/ends of sentences).

Would there be interest in the community in releasing it and/or
including it in Solr?

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Michael Imbeault
I for one would be interested in such a fragmenter, as the default one
is lacking and doesnt produce acceptable results for most applications.

Michael

Mike Klaas wrote:

> I've written an unpolished custom fragmenter for highlighting which is
> more expensive than the BasicFragmenter that ships with lucene, but
> generates more natural candidate fragments (it will tend to produce
> beginning/ends of sentences).
>
> Would there be interest in the community in releasing it and/or
> including it in Solr?
>
> -Mike
>                            

Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Erik Hatcher
In reply to this post by Mike Klaas

On Jan 3, 2007, at 6:36 PM, Mike Klaas wrote:

> I've written an unpolished custom fragmenter for highlighting which is
> more expensive than the BasicFragmenter that ships with lucene, but
> generates more natural candidate fragments (it will tend to produce
> beginning/ends of sentences).
>
> Would there be interest in the community in releasing it and/or
> including it in Solr?

No we want to stay with the more unnatural fragmenter, thank you very  
much.  :)  just kidding, of course we'd love to have better  
highlighting!   Your fragmenter could be added to the contrib/
highlighter area, and made configurable in Solr I'm sure.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Chris Hostetter-3
In reply to this post by Mike Klaas

: I've written an unpolished custom fragmenter for highlighting which is
: more expensive than the BasicFragmenter that ships with lucene, but
: generates more natural candidate fragments (it will tend to produce
: beginning/ends of sentences).
:
: Would there be interest in the community in releasing it and/or
: including it in Solr?

Mike: I don't really follow the highlighting/fragmenting buzz, but it
seems like it might make sense to contribute this directly to Lucene-Java
... of course, if you want to go ahead and commit it to Solr, it can
allways be "promoted" up to Lucene-Java later (like i suspect
FunctioQuery will be just as soon as someone gets an itch to move it)



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Mike Klaas
On 1/3/07, Chris Hostetter <[hidden email]> wrote:

>
> : I've written an unpolished custom fragmenter for highlighting which is
> : more expensive than the BasicFragmenter that ships with lucene, but
> : generates more natural candidate fragments (it will tend to produce
> : beginning/ends of sentences).
> :
> : Would there be interest in the community in releasing it and/or
> : including it in Solr?
>
> Mike: I don't really follow the highlighting/fragmenting buzz, but it
> seems like it might make sense to contribute this directly to Lucene-Java
> ... of course, if you want to go ahead and commit it to Solr, it can
> allways be "promoted" up to Lucene-Java later (like i suspect
> FunctioQuery will be just as soon as someone gets an itch to move it)

Yeah, I thought about that.  There's a few reasons I wouldn't want to
contribute it there immediately:

  - ease of maintenance
  - Highlighting is a contrib module in lucene, and there are various
aspects of it that I don't really like.  I see it more as a means of
implementing Solr's highlighting.  What I'd like to do is improve the
end-user's experience with highlighting in Solr.  If as a result a
high-quality component for lucene Highlighter is fleshed out, that can
always be contributed to Lucene later.

Generally, we should strive for a high-quality out-of-the-box
highlighting in Solr.  That might involve making things like better
fragmenters and a few other tricks(*) the default setup, and providing
a "quick & dirty" setting for speed demons.

(*) Doing some basic cleaning of the generated fragments works wonders.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Chris Hostetter-3

: implementing Solr's highlighting.  What I'd like to do is improve the
: end-user's experience with highlighting in Solr.  If as a result a
: high-quality component for lucene Highlighter is fleshed out, that can
: always be contributed to Lucene later.
:
: Generally, we should strive for a high-quality out-of-the-box
: highlighting in Solr.  That might involve making things like better
: fragmenters and a few other tricks(*) the default setup, and providing
: a "quick & dirty" setting for speed demons.

sounds good to me ... go with your gut.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Walter Underwood, Netflix
In reply to this post by Mike Klaas
On 1/3/07 5:13 PM, "Mike Klaas" <[hidden email]> wrote:

> Generally, we should strive for a high-quality out-of-the-box
> highlighting in Solr.  That might involve making things like better
> fragmenters and a few other tricks(*) the default setup, and providing
> a "quick & dirty" setting for speed demons.

I've implemented this before, once in Python and once in C, so I'd
be glad to take a look at it. I'm not sure I have time to do a lot
of implementation, but I'd sure be glad to help.

We tried several APIs and decided that the best was an array of
String with the odd elements containing the strings that needed
highlighting. That made it really easy to step through and wrap
highlighted stuff with the right markup, while properly escaping
any angle brackets in the source text.

I'm not sure how easy it is to handle that format in XSLT, but
it might be worth it. Embedded highlight markup just doesn't work.

wunder
--
Walter Underwood
Search Guru, Netflix


Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Mike Klaas
On 1/3/07, Walter Underwood <[hidden email]> wrote:

> I've implemented this before, once in Python and once in C, so I'd
> be glad to take a look at it. I'm not sure I have time to do a lot
> of implementation, but I'd sure be glad to help.

Cool.  I'll post the current fragmenter as a JIRA issue soon.

> We tried several APIs and decided that the best was an array of
> String with the odd elements containing the strings that needed
> highlighting. That made it really easy to step through and wrap
> highlighted stuff with the right markup, while properly escaping
> any angle brackets in the source text.

That is _much_ better than the current system.  It wouldn't be hard to
add start/end offsets to the fragments too, as Chris suggested so long
ago.

> I'm not sure how easy it is to handle that format in XSLT, but
> it might be worth it. Embedded highlight markup just doesn't work.

Quite a few aspects of the result format appear to be problematic from
a direct-XSLT-consumption perspective... which probably isn't too
surprising as I don't think that was really the original intent.

Maybe it would be a good idea to work toward a simplified XML response
writer designed for XSLT consumption?

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Yonik Seeley-2
In reply to this post by Walter Underwood, Netflix
On 1/3/07, Walter Underwood <[hidden email]> wrote:
> We tried several APIs and decided that the best was an array of
> String with the odd elements containing the strings that needed
> highlighting.

Good idea... the only thing I could think of was an array of start/end
offsets into the string, which is harder to read and probably harder
to deal with.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Yonik Seeley-2
In reply to this post by Mike Klaas
On 1/3/07, Mike Klaas <[hidden email]> wrote:
> That is _much_ better than the current system.  It wouldn't be hard to
> add start/end offsets to the fragments too, as Chris suggested so long
> ago.

Or leave room for other info such as weights, or what term matched, etc.

> Quite a few aspects of the result format appear to be problematic from
> a direct-XSLT-consumption perspective... which probably isn't too
> surprising as I don't think that was really the original intent.

Yeah, I certainly didn't envision direct XSLT processing as it's
certainly not a CNET usecase, and there are security issues with
directly hitting a Solr index.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Walter Underwood, Netflix
In reply to this post by Yonik Seeley-2
On 1/3/07 9:33 PM, "Yonik Seeley" <[hidden email]> wrote:

> On 1/3/07, Walter Underwood <[hidden email]> wrote:
>> We tried several APIs and decided that the best was an array of
>> String with the odd elements containing the strings that needed
>> highlighting.
>
> Good idea... the only thing I could think of was an array of start/end
> offsets into the string, which is harder to read and probably harder
> to deal with.

Yep. The client code for the even/odd List is really simple.
Something like this:

for (int i=0; i<list.size(); i++) {
    if (i%2 == 1) sb.append("<b>");
    sb.append(handyXmlQuotingMethod(list.get(i)));
    if (i%2 == 1) sb.append("</b>");
}

wunder


Reply | Threaded
Open this post in threaded view
|

Re: Better highlighting fragmenter

Mike Klaas
On 1/3/07, Walter Underwood <[hidden email]> wrote:

> On 1/3/07 9:33 PM, "Yonik Seeley" <[hidden email]> wrote:
>
> > On 1/3/07, Walter Underwood <[hidden email]> wrote:
> >> We tried several APIs and decided that the best was an array of
> >> String with the odd elements containing the strings that needed
> >> highlighting.
> >
> > Good idea... the only thing I could think of was an array of start/end
> > offsets into the string, which is harder to read and probably harder
> > to deal with.
>
> Yep. The client code for the even/odd List is really simple.
> Something like this:
>
> for (int i=0; i<list.size(); i++) {
>     if (i%2 == 1) sb.append("<b>");
>     sb.append(handyXmlQuotingMethod(list.get(i)));
>     if (i%2 == 1) sb.append("</b>");
> }

We wouldn't even have to complicate the API -> just trigger this
functionality based on hl.formatter.  hl.formatter=interleave?

-Mike