highlighting

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

highlighting

Erik Hatcher
I would like to have highlighting of selected field(s) in Solr search  
results.  Certainly a custom request handler can do this, but I'm  
curious if the standard handler and configuration should evolve to  
handle the common need for search term highlighting, and if so how  
would that ideally look in the configuration and search request?

I am game for developing the highlighting piece in some way in the  
next few days, and would gladly contribute that feature back provided  
it was done in a way that fits with Solr's architecture.

Thanks,
        Erik

Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Yonik Seeley
On 4/4/06, Erik Hatcher <[hidden email]> wrote:
> I would like to have highlighting of selected field(s) in Solr search
> results.  Certainly a custom request handler can do this, but I'm
> curious if the standard handler and configuration should evolve to
> handle the common need for search term highlighting,

Absolutely!

> and if so how would that ideally look in the configuration and search request?

Great question... and how would it look in the search results as well.
I haven't used highlighting yet in Lucene, so I'm not sure what the
best way to fit it into Solr would be.

I guess it's time to go read that part in LIA :-)

One thing right off the bat: I think highlighting probably needs the
stored fields...
To support streaming of large result sets, I don't retrieve all the
documents up front - it's actually done in the XML serializer.  That
may make things slightly more difficult.

It's probably best to focus on the ideal interface first (query
parameters as input format, and desired XML output format).

For the XML output format, we need to decide if the hilight info goes
in or after each <field>, in or after each <doc>, or in a separate
section altogether.  Also need to consider multivalued fields.

The current format for fields looks like this for single-valued fields:
  <field name="title">How now brown cow</field>
And this for multi-valued fields:
  <arr name="title"><str>This is the first title</str> <str>This is
the second</str> </arr>

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Yonik Seeley
> It's probably best to focus on the ideal interface first (query
> parameters as input format, and desired XML output format).

We might also want to keep termvectors in mind when thinking about
this stuff... seems like they are related (per-field optional/extra
data).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Chris Hostetter-3
In reply to this post by Erik Hatcher

For the record, i know next to nothing about highlighting in Lucene.  i
can't remember if i read that chapter in LIA or not :)

: curious if the standard handler and configuration should evolve to
: handle the common need for search term highlighting, and if so how

+1

: would that ideally look in the configuration and search request?

one of the things i've been doing in my custom plugins (one of which is
really generic and i'm hoping to get permission to commit it back to solr
real soon now) is to make every possible query param have a corrisponding
identically named init param (in the solr config) which it uses as the
default.  That way you can have...
    <str name="highlightFields">title description</str>
...in your solrconfig.xml, and clients that want differnet behavior can
override it with...
   highlightFields=title+description+body
...in the URL.

: I am game for developing the highlighting piece in some way in the
: next few days, and would gladly contribute that feature back provided
: it was done in a way that fits with Solr's architecture.

from a usage standpoint, i think adding both a URL param and init param
to StandardRequestHandler that takes in a space seperated list of
fieldNames to highlight makes a lot of sense ... the question is what do
we do with it?

Modifing XMLWriter and SolrQueryResponse to have "defaultHighlightFields"
in the same way they currently have "defaultReturnFields" seems like it
makes the most sense, (especially since that way other plugins can use it
as well).  Then the XMLWriter can include a new <hi>word</hi> in it's
output anytime it wants to highlight something.

(NOTE: Adding XML markup for highlighting probably means the default
"Protocol Version" should be rev'ed to 2.2, and highlighting should be
flat out disabled if the version is less then that  so older clients
aren't suddenly suprised to find xml markup in their strings if the server
configuration cahnges)


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Erik Hatcher
I managed to hack some highlighting into a request handler last night  
for a quick and dirty application demo, but it is less than ideal.    
The current situation with XMLWriter actually pulling the Document  
from the index coupled with the lack of access to the Query causes  
this to currently be a tricky situation.  My hack is just within the  
handleRequest method of the request handler and makes a second pass  
over the DocList and re-retrieves the Document objects to highlight  
them, and adds the highlighted text to additional XML elements in the  
response, not to the <doc>'s.  So my current hack is not worth  
contributing.

Yonik additionally brought up some other very good points regarding  
term vectors and stored fields.  Stored fields would be necessary for  
highlighting in the general sense, certainly, but I envision some  
applications wanting to store the original text elsewhere and a  
custom highlighting hook used to retrieve the original text through  
other means.

I'm not quite sure where to go with this highlighting issue from here  
given what seems to be a bit of an overhaul in where the Document  
objects are accessed, or in being able to get the full context of the  
Query (and filters, etc) down to the XMLWriter.

Thoughts?

        Erik



On Apr 4, 2006, at 9:18 PM, Chris Hostetter wrote:

>
> For the record, i know next to nothing about highlighting in  
> Lucene.  i
> can't remember if i read that chapter in LIA or not :)
>
> : curious if the standard handler and configuration should evolve to
> : handle the common need for search term highlighting, and if so how
>
> +1
>
> : would that ideally look in the configuration and search request?
>
> one of the things i've been doing in my custom plugins (one of  
> which is
> really generic and i'm hoping to get permission to commit it back  
> to solr
> real soon now) is to make every possible query param have a  
> corrisponding
> identically named init param (in the solr config) which it uses as the
> default.  That way you can have...
>     <str name="highlightFields">title description</str>
> ...in your solrconfig.xml, and clients that want differnet behavior  
> can
> override it with...
>    highlightFields=title+description+body
> ...in the URL.
>
> : I am game for developing the highlighting piece in some way in the
> : next few days, and would gladly contribute that feature back  
> provided
> : it was done in a way that fits with Solr's architecture.
>
> from a usage standpoint, i think adding both a URL param and init  
> param
> to StandardRequestHandler that takes in a space seperated list of
> fieldNames to highlight makes a lot of sense ... the question is  
> what do
> we do with it?
>
> Modifing XMLWriter and SolrQueryResponse to have  
> "defaultHighlightFields"
> in the same way they currently have "defaultReturnFields" seems  
> like it
> makes the most sense, (especially since that way other plugins can  
> use it
> as well).  Then the XMLWriter can include a new <hi>word</hi> in it's
> output anytime it wants to highlight something.
>
> (NOTE: Adding XML markup for highlighting probably means the default
> "Protocol Version" should be rev'ed to 2.2, and highlighting should be
> flat out disabled if the version is less then that  so older clients
> aren't suddenly suprised to find xml markup in their strings if the  
> server
> configuration cahnges)
>
>
> -Hoss

Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Chris Hostetter-3

: this to currently be a tricky situation.  My hack is just within the
: handleRequest method of the request handler and makes a second pass
: over the DocList and re-retrieves the Document objects to highlight
: them, and adds the highlighted text to additional XML elements in the
: response, not to the <doc>'s.  So my current hack is not worth
: contributing.

I disagree ... i think that's actually a pretty decent approach.

After the first burst of discussion on this thread, i remember thinking
that it would not only be hard to modify the XMLWriter, but also confusing
to know how to deal with it in the client -- the simplicity of the current
response in which a <str> is just a string would be broken -- now a <Str>
might have nested highlihgting information.

I also remember thinking that if highlighting was done "inline" then the
onus of finding good "snippets" would be left to the client.

I could have sworn i sent out another followup message about this, but i
can't find it now -- must have been one of those emaisl i composed in my
head while i was falling asleep and then forgot about.

I think having a seperate data payload containing highlighted snippets
(which may or may not be whole stored fields) really may be the best
approach. ... the question i was strugglig over was what format should
that data take: somethine new not currently possible, or something that
fits into the existing tag structure?

Something that might work is to add a list per doc in the DocList,
containing NamedLists where the names are fields the client wants
highlighted, each containing an list of "snippets" where each snippet is
an NamedList where the values are chunks of text in order, and the chunk
has a name if it should be highlighted

ie...

  <lst name="highlighting">
    <lst>  <!-- first doc in doclist -->
      <lst name="title"> <!-- first field w/highlighting -->
        <!-- may be multiple snippets per field -->
        <lst> <!-- first snippet -->
          <str>now is the time for </str>
          <str name="highlight">all good</str>
          <str> men to come to the</str>
        </lst>
        <lst><!-- second snippet -->
          ...
        </lst>
      </lst>
      <lst name="body"> <!-- second field w/highlighting -->
       ...
      </lst>
    </lst>
    <lst>  <!-- second doc in doclist -->
      ...
    </lst>
    ...

...it seems a little verbose, but it contains allows for arbitrary
highlighting of aritrary sized snippets, doesn't introduce any new
complexity to the XML Format (or XML Writing) and could be implimented
completely independently of the Documents themselves (so plugins could
fetch the text to be highlighted from external data)




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Yonik Seeley
In reply to this post by Erik Hatcher
On 4/17/06, Erik Hatcher <[hidden email]> wrote:
> The current situation with XMLWriter actually pulling the Document
> from the index

Yeah, but seeing people ask for *all* matching documents (or sometimes
evel all documents in the index), makes me think that we need to keep
streamability.

> coupled with the lack of access to the Query causes
> this to currently be a tricky situation.
> My hack is just within the
> handleRequest method of the request handler and makes a second pass
> over the DocList and re-retrieves the Document objects to highlight
> them,


There are a number of ways this could be handled, I think.

1) Preventing documents from being retrieved more than once:
  a) may not be a big deal with the document cache enabled, since they
should still be there
  b) could create a subclass of DocList or another class that contains
Document objects, not just the ids.  XMLWriter would need to be
changed to handle this type of class.

2) Access to the query for highlighting:
  a) I don't think streamability of results is important for
highlighting (I assume no one will ask for a million documents and
have them all highlighted), so it could be done ahead of time for all
the documents.
  b) More context (or even user-specified context) could be added to
the SolrRequest, and the Query(s) could go there.
  c) If we had a custom DocList object from 1.b then it could also
have a custom one for highlighting that carried this extra info.

> and adds the highlighted text to additional XML elements in the
> response, not to the <doc>'s.  So my current hack is not worth
> contributing.

I'm not even sure what the ideal highlighter syntax would look like...
Do you have an example of what you would consider ideal?
Highlighting seems important and universal enough that I wouldn't be
opposed to adding special syntax for it if it's reallly needed.  We
would want to make it flexible/powerful enough to handle whatever Mark
Harwood is cooking up for future highlighting as well.

> Yonik additionally brought up some other very good points regarding
> term vectors and stored fields.  Stored fields would be necessary for
> highlighting in the general sense, certainly, but I envision some
> applications wanting to store the original text elsewhere and a
> custom highlighting hook used to retrieve the original text through
> other means.

Hmmm, some sort of callback interface for XMLWriter for classes it
doesn't know about?

> I'm not quite sure where to go with this highlighting issue from here
> given what seems to be a bit of an overhaul in where the Document
> objects are accessed, or in being able to get the full context of the
> Query (and filters, etc) down to the XMLWriter.

Ahh, just details... nothing that can't be fixed.

> Thoughts?

Focus on the interface:
 - how clients will specify what extra info they want
 - how clients typically parse and use the XML (extra bonus if we can
make it semi-friendly to stylesheets/XSLT), and the ideal syntax for
representing the extra info

Then it's just a small matter of implementing it :-)

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Chris Hostetter-3

: Focus on the interface:
:  - how clients will specify what extra info they want
:  - how clients typically parse and use the XML (extra bonus if we can
: make it semi-friendly to stylesheets/XSLT), and the ideal syntax for
: representing the extra info

To add to that: when thinking about "how clients will specify what extra
info they want" we should consider not only external clients using HTTP
and the StandardRequestHandler, but also what the internal API looks like
for people wanting to add highlighing to their own plugin.




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Erik Hatcher
Hoss, I've seen you mention "plugin" several times... I presume you  
mean a custom request handler.  If not, could you elaborate on what  
you mean?

Thanks,
        Erik


On Apr 18, 2006, at 1:17 PM, Chris Hostetter wrote:

>
> : Focus on the interface:
> :  - how clients will specify what extra info they want
> :  - how clients typically parse and use the XML (extra bonus if we  
> can
> : make it semi-friendly to stylesheets/XSLT), and the ideal syntax for
> : representing the extra info
>
> To add to that: when thinking about "how clients will specify what  
> extra
> info they want" we should consider not only external clients using  
> HTTP
> and the StandardRequestHandler, but also what the internal API  
> looks like
> for people wanting to add highlighing to their own plugin.
>
>
>
>
> -Hoss

Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Yonik Seeley
In reply to this post by Chris Hostetter-3
On 4/18/06, Chris Hostetter <[hidden email]> wrote:
> To add to that: when thinking about "how clients will specify what extra
> info they want" we should consider not only external clients using HTTP
> and the StandardRequestHandler, but also what the internal API looks like
> for people wanting to add highlighing to their own plugin.

And at a lower priority, other formats than XML.
I've considered adding a JSON response format that's smaller and very
AJAX friendly.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Erik Hatcher

On Apr 18, 2006, at 1:22 PM, Yonik Seeley wrote:

> On 4/18/06, Chris Hostetter <[hidden email]> wrote:
>> To add to that: when thinking about "how clients will specify what  
>> extra
>> info they want" we should consider not only external clients using  
>> HTTP
>> and the StandardRequestHandler, but also what the internal API  
>> looks like
>> for people wanting to add highlighing to their own plugin.
>
> And at a lower priority, other formats than XML.
> I've considered adding a JSON response format that's smaller and very
> AJAX friendly.

Yes indeed.  I've been thinking of ways to increase the performance  
to a Ruby (on Rails) front-end as well.  Serializing the response as  
plain Ruby that gets eval'd on the client side would very likely be  
the most performant way to do it, or perhaps as YAML instead of XML.

Callbacks from the serializer to get documents and their fields, and  
also highlighting and such, would be a nice way to go about things it  
seems... decoupling without losing the streaming capability.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: highlighting

Chris Hostetter-3
In reply to this post by Erik Hatcher

: Hoss, I've seen you mention "plugin" several times... I presume you
: mean a custom request handler.  If not, could you elaborate on what
: you mean?

sorry, yes ... most generally I mean any code which doesn't ship with
Solr, but which is loaded into the JVM at Solr's request because of
configuration specified by the user.

In the context of highlighting I'm refering to custom SolrRequestHandlers,
but in a broader context I think of custom Analyzers, TokenizerFactories,
TokenFilterFactories, Similarities, SolrCaches, CacheRegenerators, and
UpdateHandlers as types of "plugins".


-Hoss