highlight search keywords on html page

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

highlight search keywords on html page

nick19701
With solr, I can generate a list of links containing highlighted fragments.
After a user clicks a link, I will fetch the stored and not-indexed html from solr and return it to user.
But I want search keywords within the html to be highlighted just like google.
I'm wondering what people are using to accomplish this very common task.
Reply | Threaded
Open this post in threaded view
|

Re: highlight search keywords on html page

Chris Hostetter-3

I'm not sure i'm understanding your question ... is it how to highlight a
stored field that has HTML in it, or how to index a chunk of HTML text?

the first should be no difference then highlighting any other bit of text
-- the second can be accomplished using the
HTMLStripStandardTokenizerFactory (or
HTMLStripWhitespaceTokenizerFactory) in your schema.

: With solr, I can generate a list of links containing highlighted fragments.
: After a user clicks a link, I will fetch the stored and not-indexed html
: from solr and return it to user.
: But I want search keywords within the html to be highlighted just like
: google.
: I'm wondering what people are using to accomplish this very common task.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: highlight search keywords on html page

nick19701
Chris Hostetter wrote
I'm not sure i'm understanding your question ... is it how to highlight a
stored field that has HTML in it, or how to index a chunk of HTML text?

the first should be no difference then highlighting any other bit of text
-- the second can be accomplished using the
HTMLStripStandardTokenizerFactory (or
HTMLStripWhitespaceTokenizerFactory) in your schema.

-Hoss

It seems both cases you described are not what I want:
Please allow me to explain it again:

I have two fields in my doc:
 <field name="html" type="string" indexed="false" stored="true" compressed="true"/>
 <field name="pageContent" type="text" indexed="true" stored="true" compressed="true"/>
 
In "html" I store the raw html grabbed from internet. It's not indexed, and just stored as string.
After removing tags in "html", I get text and store it as "pageContent". This field
will be indexed and stored.

When a user performs a search, I will return a list of links containing highlighted fragments
from "pageContent". If a link is clicked, I want to return the associated raw html back
to user AND have search keywords in it to be highlighted, just like google cached page.
Reply | Threaded
Open this post in threaded view
|

Re: highlight search keywords on html page

Chris Hostetter-3

: When a user performs a search, I will return a list of links containing
: highlighted fragments
: from "pageContent". If a link is clicked, I want to return the associated
: raw html back
: to user AND have search keywords in it to be highlighted, just like google
: cached page.

i'm not really sure that Solr can help you in this case ... it only know
about the data you give it -- if you want it to highlight the raw html of
hte entire page, then you're going to need to store the raw html of hte
entire page in the index.

you can still highlight pageContent with heavy fragmentation on your main
search page where you list multiple results, and then when a user picks
one redo the search with an fq restricting to the doc they picked and
hl.fl=rawHtml and hl.fragsize=0 so you'll get the whole highlighted
without fragmentation.

-Hoss

Reply | Threaded
Open this post in threaded view
|

AW: highlight search keywords on html page

Burkamp, Christian
I was thinking about the same thing. It shouldn't be too difficult to subclass SolrRequestHandler and build a special HighlightingRequestHandler that uses the builtin highlighting utils to do the job. I wonder if it's possible to get access to the http request body inside a SolrRequestHandler subclass. (The raw text to be highlighted would have to be passed to solr as body in an http request).
Storing the raw text in the solr index is a reasonable solution for small indexes only.

--Christian


-----Urspr√ľngliche Nachricht-----
Von: Chris Hostetter [mailto:[hidden email]]
Gesendet: Montag, 19. Februar 2007 03:00
An: [hidden email]
Betreff: Re: highlight search keywords on html page



: When a user performs a search, I will return a list of links containing
: highlighted fragments
: from "pageContent". If a link is clicked, I want to return the associated
: raw html back
: to user AND have search keywords in it to be highlighted, just like google
: cached page.

i'm not really sure that Solr can help you in this case ... it only know about the data you give it -- if you want it to highlight the raw html of hte entire page, then you're going to need to store the raw html of hte entire page in the index.

you can still highlight pageContent with heavy fragmentation on your main search page where you list multiple results, and then when a user picks one redo the search with an fq restricting to the doc they picked and hl.fl=rawHtml and hl.fragsize=0 so you'll get the whole highlighted without fragmentation.

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: AW: highlight search keywords on html page

Chris Hostetter-3

: I was thinking about the same thing. It shouldn't be too difficult to
: subclass SolrRequestHandler and build a special
: HighlightingRequestHandler that uses the builtin highlighting utils to
: do the job. I wonder if it's possible to get access to the http request
: body inside a SolrRequestHandler subclass. (The raw text to be
: highlighted would have to be passed to solr as body in an http request).
: Storing the raw text in the solr index is a reasonable solution for
: small indexes only.

actually ... there is some experimental stuff on the trunk that Ryan
contributed recently that adds a new dispatcher for executing request
hanlders... one of the perks of this dispatcher is a new concept of
"ContentStreams" that can be made available to SolrRequestHandlers either
as part of hte orriginla HTTP request (post body or multipart/* file
uploads depending on mime-type) or as a URL refrenced in the request
params.

take a look at the nightly build javadocs for more info about the
ContentStream interface (there is an Interable of them in
SolrQueryRequest) ... the way to get your SOlrRequestHandler to be
processed by the dispatcher is to register it with a name starting with a
slash which dicates the URL (so instead of /solr/update?qt=/foo you would
use /solr/foo)

there are some examples in example solrconfig ... look for /update/xml and
/debug/dump)



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: highlight search keywords on html page

nick19701
In reply to this post by Chris Hostetter-3
Chris Hostetter wrote
i'm not really sure that Solr can help you in this case ... it only know
about the data you give it -- if you want it to highlight the raw html of
hte entire page, then you're going to need to store the raw html of hte
entire page in the index.

you can still highlight pageContent with heavy fragmentation on your main
search page where you list multiple results, and then when a user picks
one redo the search with an fq restricting to the doc they picked and
hl.fl=rawHtml and hl.fragsize=0 so you'll get the whole highlighted
without fragmentation.

-Hoss
Thank you very much for clearing things up for me. I have this misconception that
I can only index pure text with solr or lucene. I don't know where I got this notion. But
as you pointed out in your first reply, with HTMLStripStandardTokenizerFactory I
can actually index html with solr. This is a brand-new idea to me.