unable to figure out nutch type highlighting in solr....

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

unable to figure out nutch type highlighting in solr....

Ravish Bhagdev
I have tried very hard to follow documentation and forums that try to
answer questions about how to return snippets with highlights for
relevant searched term using Solr (as nutch does with such ease).

I will be really grateful if someone can guide me with basics, i have
made sure that the field to be highlighted is "stored" in index etc.
Still I can't figure out why it doesn't return the snippet and instead
returns the whole document.

I have tried all different highlight parameters with variations, but
no idea what's happening.  Can I test highlighting using given
application using "full search interface" option?  How, it just
returns xml with full document between field tag at the moment.

Please find attached my conf files as well....

solrconfig.xml (27K) Download Attachment
schema.xml (26K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Mike Klaas
In 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote:

> I have tried very hard to follow documentation and forums that try to
> answer questions about how to return snippets with highlights for
> relevant searched term using Solr (as nutch does with such ease).
>
> I will be really grateful if someone can guide me with basics, i have
> made sure that the field to be highlighted is "stored" in index etc.
> Still I can't figure out why it doesn't return the snippet and instead
> returns the whole document.
>
> I have tried all different highlight parameters with variations, but
> no idea what's happening.  Can I test highlighting using given
> application using "full search interface" option?  How, it just
> returns xml with full document between field tag at the moment.

Note that the highlighting data is _not_ returned in the <doc>  
section of the response.  Getting the whole document back is probably  
due to asking for all fields (coupled with having stored the main  
text field).

You can play with the highlighting in the admin ui.  Besides having a  
few parameters directly present, the others can be added directly to  
the url for testing.

The minimum required for highlighting is:
  1. hl=true
  2. hl.fl=myfield

_If_ that field matches one of the query terms, you should see  
snippets in the generated response.  EVen if not, you should see a  
<highlighting> section of the response (it will be empty).

regards,
-Mike
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Mike Klaas
In reply to this post by Ravish Bhagdev

On 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote:

> <schema.xml>

I see that you're using the HTML analyzer.  Unfortunately that does  
not play very well with highlighting at the moment. You may get  
garbled output.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Adrian Sutton Symphonious
> I see that you're using the HTML analyzer.  Unfortunately that does  
> not play very well with highlighting at the moment. You may get  
> garbled output.

Is it the HTML analyzer or the fact that it's HTML content? If it's  
just the analyzer you could always just copy the HTML content to  
another field with a different analyzer and use that for highlighting  
(but search on the original field). Would this work, and if so, which  
analyzer would be suitable for the second field?

Adrian Sutton
http://www.symphonious.net
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Mike Klaas
On 4-Oct-07, at 3:19 PM, Adrian Sutton wrote:

>> I see that you're using the HTML analyzer.  Unfortunately that  
>> does not play very well with highlighting at the moment. You may  
>> get garbled output.
>
> Is it the HTML analyzer or the fact that it's HTML content? If it's  
> just the analyzer you could always just copy the HTML content to  
> another field with a different analyzer and use that for  
> highlighting (but search on the original field). Would this work,  
> and if so, which analyzer would be suitable for the second field?

the HTML analyzer strips html but doesn't update the offsets nicely  
(the highlighter uses these to determine where to insert the <em> tags).

If you use a "normal" analyzer (like WordDelimiterFilter) directly on  
the HTML, the offsets will be correct but you will get HTML tags  
returned in your output, which you will have to be careful to strip.
(which means you couldn't use the default '<em>' as highlighting  
markers).

In general, I don't recommend indexing HTML content straight to  
Solr.  None of the Solr contributors do this so the use case hasn't  
received a lot of love.

I'm actually somewhat surprised that several people are interested in  
this but none have have been sufficiently interested to implement a  
solution to contribute:

http://issues.apache.org/jira/browse/SOLR-42

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Adrian Sutton Symphonious
On 05/10/2007, at 8:45 AM, Mike Klaas wrote:
> In general, I don't recommend indexing HTML content straight to  
> Solr.  None of the Solr contributors do this so the use case hasn't  
> received a lot of love.

We're indexing XHTML straight to Solr and it's working great so far.

> I'm actually somewhat surprised that several people are interested  
> in this but none have have been sufficiently interested to  
> implement a solution to contribute:
>
> http://issues.apache.org/jira/browse/SOLR-42

Didn't know there was a problem to solve. We're a fair way off  
actually playing with highlighting but I'll keep an eye on this for  
when we get to it.

> -Mike

Thanks,

Adrian Sutton
http://www.symphonious.net

Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

hossman
In reply to this post by Mike Klaas

: In general, I don't recommend indexing HTML content straight to Solr.  None of
: the Solr contributors do this so the use case hasn't received a lot of love.

I second that comment ... the HTML Striping code was never intended to be
an "HTML Parser" it was designed to be a workarround for dealing with
"dirty data" where people had unwanted HTML tags in what should be plain
text.  indexing as is with some analyzers would result in words like
"script", "strong", and "class" matching lots of docs where the words
never relaly appear in the text.

if you have wellformed HTML documents, use an HTML parser to extract the
real content.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Walter Underwood, Netflix
Wow, well-formed HTML. That's a rare beast. --wunder

On 10/4/07 7:08 PM, "Chris Hostetter" <[hidden email]> wrote:

> if you have wellformed HTML documents, use an HTML parser to extract the
> real content.

Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

jjlarrea
In reply to this post by Mike Klaas
At 3:45 PM -0700 10/4/07, Mike Klaas wrote:
>I'm actually somewhat surprised that several people are interested in this but none have have been sufficiently interested to implement a solution to contribute:
>
>http://issues.apache.org/jira/browse/SOLR-42

I just devised a workaround earlier in the week and was planning on posting it; thanks to your nudge I just did (to SOLR-42).  Hopefully it may be of use to someone else.

It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or XML-like tags:

  (?:\s*</?\w+((\s+\w+(\s*=\s*(?:"?&"'.?'|[^'">\s]+))?)\s*|\s*)/?>\s*)|\s

and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets.

Of course this is just a hack: It doesn't have any real understanding of HTML or XML syntax, so something invalid like </closing attr="x"/> will get matched. On the other hand, < and > in text will be left alone.

Also note it doesn't decode XML or HTML numeric or symbolic entity references, as HTMLStripReader does (my indexer is pre-decoding the entity references before sending the text to Solr for indexing).

So fixing HTMLStripReader and its dependent HTMLStripXXXTokenizers to do the right thing with offsets would still be a worthy task.  I wonder whether recasting HTMLStripReader using the org.apache.lucene.analysis.standard.CharStream interface would make sense for this?

(I just added the above to the Jira comment, please pardon the redundancy)

- J.J.
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Ravish Bhagdev
In reply to this post by hossman
Thanks all for help.

Just to make sure I understand correctly, am I right in summarizing
this way than?:

No significance of using HTML: Unlike nutch Solr doesn't parse HTML,
so it ignores the anchors, titles etc and is not good for page rank
-esq indexing.

HTMLAnalyser (by with you probably mean HTMLStripWhitespaceTokenizer?)
: Main purpose is to allow users to index html code, it will strip the
html tags and index the contents, but if used for getting snippets in
results the <em> tags may be in wrong locations

To avoid using HTMLAnalyser, strip out the tags yourself and only send
text to Solr for indexing using one of the "normal" analysers.
Highlighting should be accurate in this case.

(Query esp. Adrian):

If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation?  What's
the best way out?  It would help me a lot if you briefly explain your
configuration.

Do let me know if my assumptions are wrong!

Cheers,
Ravish

On 10/5/07, Chris Hostetter <[hidden email]> wrote:

>
> : In general, I don't recommend indexing HTML content straight to Solr.  None of
> : the Solr contributors do this so the use case hasn't received a lot of love.
>
> I second that comment ... the HTML Striping code was never intended to be
> an "HTML Parser" it was designed to be a workarround for dealing with
> "dirty data" where people had unwanted HTML tags in what should be plain
> text.  indexing as is with some analyzers would result in words like
> "script", "strong", and "class" matching lots of docs where the words
> never relaly appear in the text.
>
> if you have wellformed HTML documents, use an HTML parser to extract the
> real content.
>
>
>
> -Hoss
>
>
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Adrian Sutton Symphonious
On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
> (Query esp. Adrian):
>
> If you are indexing XHTML, do you replace tags with entities before
> giving it to solr, if so, when you get back snippets do you get tags
> or entities or do you convert again to tags for presentation?  What's
> the best way out?  It would help me a lot if you briefly explain your
> configuration.

We happen to develop a HTML editor so we know 100% for certain that  
the XHTML is valid XML. Given that we just throw the raw XHTML at  
Solr which uses the HTMLStripWhitespaceTokenizer. However, at this  
stage we haven't configured highlighting at all, so our index is used  
for search and retrieving a document ID. At some point I'd like to  
add highlighting and it sounds like the best way to do so would be to  
index the document text instead of the HTML.

Beyond that, we also use Solr as an optimization for extracting  
information such as what content was most recently changed, which  
pages link to others etc. On the page linking, we actually identify  
what pages are linked to prior to indexing and store them as a  
separate field - Solr itself has no understanding of the linking.

Oh and I should note, I'm very new to Solr so I'm probably not doing  
things the best way, but I'm getting great results anyway.

Regards,

Adrian Sutton
http://www.symphonious.net

Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Ravish Bhagdev
Thanks Adrian,  I'm very new to Solr myself so struggling a bit in
initial stages...

One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?

Ravish

On 10/5/07, Adrian Sutton <[hidden email]> wrote:

> On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
> > (Query esp. Adrian):
> >
> > If you are indexing XHTML, do you replace tags with entities before
> > giving it to solr, if so, when you get back snippets do you get tags
> > or entities or do you convert again to tags for presentation?  What's
> > the best way out?  It would help me a lot if you briefly explain your
> > configuration.
>
> We happen to develop a HTML editor so we know 100% for certain that
> the XHTML is valid XML. Given that we just throw the raw XHTML at
> Solr which uses the HTMLStripWhitespaceTokenizer. However, at this
> stage we haven't configured highlighting at all, so our index is used
> for search and retrieving a document ID. At some point I'd like to
> add highlighting and it sounds like the best way to do so would be to
> index the document text instead of the HTML.
>
> Beyond that, we also use Solr as an optimization for extracting
> information such as what content was most recently changed, which
> pages link to others etc. On the page linking, we actually identify
> what pages are linked to prior to indexing and store them as a
> separate field - Solr itself has no understanding of the linking.
>
> Oh and I should note, I'm very new to Solr so I'm probably not doing
> things the best way, but I'm getting great results anyway.
>
> Regards,
>
> Adrian Sutton
> http://www.symphonious.net
>
>
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Adrian Sutton Symphonious
> One last one, when you send HTML to solr, do you too replace special
> chars and tags with named entities?  I did this and HTMLStripper
> doesn't seem to recognise them the tags :-S  While if I try and input
> HTML as is indexer throws exceptions (as having tags within XML tags
> is obviously not valid.  How to do this part?

We didn't do anything at all to the HTML, the editor returns valid  
XHTML (using numeric entities, never named entities which aren't  
valid in XML and don't tend to work in XHTML) and we do string  
concatenation to build up the /update request body like:

requestBody += "<str name=\"content\">" + xhtmlContent + "</str>";

Solr seems to handle it. From what people are suggesting though you'd  
be better off converting to plain text before indexing it with Solr.  
Something like JTidy (http://jtidy.sf.net) can parse most HTML that's  
around and you can iterate over the DOM to extract the text from there.

Regards,

Adrian Sutton
http://www.symphonious.net
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Walter Underwood, Netflix
In reply to this post by jjlarrea
That is one seriously manly regex, but I'd recommend using the Tag Soup
parser instead:

  http://ccil.org/~cowan/XML/tagsoup/

wunder

On 10/4/07 10:11 PM, "J.J. Larrea" <[hidden email]> wrote:

> It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or
> XML-like tags:
>
>   (?:\s*</?\w+((\s+\w+(\s*=\s*(?:"?&"'.?'|[^'">\s]+))?)\s*|\s*)/?>\s*)|\s

Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

steve_rowe
In reply to this post by Adrian Sutton Symphonious
Adrian Sutton wrote:
> We didn't do anything at all to the HTML, the editor returns valid XHTML
> (using numeric entities, never named entities which aren't valid in XML
> and don't tend to work in XHTML) [...]

Named entity references are valid in XML.  They just need to be declared
before they are used[1], unless they are one of the builtin named
entities &lt; &gt; &apos; &quot; or &amp; -- these are always valid when
parsing with an XML parser.

XHTML is XML, so if parsed by an XML parser, XML's builtin named
entities are available, and if the parser doesn't ignore external
entities, then the same set of (roughly) 250 named entities defined in
HTML are available as well[2].

Steve

[1] XML well-formedness constraint - entities must be declared:
<http://www.w3.org/TR/xml/#wf-entdeclared>

[2] Named entities defined in XHTML 1.0
<http://www.w3.org/TR/xhtml1/dtds.html#h-A2>
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

jjlarrea
In reply to this post by Adrian Sutton Symphonious
At 9:32 PM +1000 10/5/07, Adrian Sutton wrote:
>From what people are suggesting though you'd be better off converting to plain text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net) can parse most HTML that's around and you can iterate over the DOM to extract the text from there.

It depends entirely on the use-case.  You can fire HTML or XML at a Solr field (possibly wrapping it in a CDATA block as just suggested by Pieter Berkel) and have it stored verbatim, then what happens at index-time is entirely dependent on the Analyzer chain: Treat tags and attributes as if they were text, remove them entirely, etc.  You can strip the markup before sending the data and so store and/or index just the text content.  You can use XSLT or other means to extract data to be indexed in specific fields.  And, as Benoit Pauwels just wrote, a combination of these techniques might be the most appropriate for a particular application, e.g. field-specific search yielding marked-up documents.

The HTMLStripXXX tokenizers appear to do a fine job of entity conversion and tag stripping, and so if highlighting is not a consideration then it makes the markup stripping very convenient, allowing storage of the document with markup and indexing of just the text content.

The primary issue with HTMLStripXXX is for the use-case when one wants to return the stored HTML/XML content with highlighting markup inserted around the text content, but preserving the original markup.  For example, have
    <topic type="location">Paris</topic>
highlighted as
    <topic type="location"><span class="highlighted">Paris</span></topic>

For that the original marked-up version (rather than stripped) must be stored, a markup-stripped version should probably (but not necessarily) be indexed, and the offsets of the indexed tokens must properly point to the locations of those tokens in the stored version.  The HTMLStripXXX tokenizers ignore the offset of the stripped content (both tags and attributes, but also when entities are converted to characters) and so the token /paris/ in the example above is given the offset of the opening <, and the highlighting falls within (and thus destroys) the <topic > tag.  The PatternTokenizer workaround posted to SOLR-42 will fulfill this use-case.

But a different use-case might be for the highlighting to encompass the markup rather than just the text, e.g.
    <span class="highlighted"><topic type="location">Paris</topic></span>
which would have to be accomplished some other way.

- J.J.
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Ravish Bhagdev
In reply to this post by Adrian Sutton Symphonious
Thanks all for very valuable contributions, I understand these aspects
of Solr much better now

but...

>But a different use-case might be for the highlighting to encompass
the markup rather than >just the text, e.g.
>   <span class="highlighted"><topic type="location">Paris</topic></span>
>which would have to be accomplished some other way.

Yes, exactly.  And I think nutch handles this somehow as I remember
using it for indexing HTML and then returning snippets with accurate
highlighting placed within html snippets.

Is there a potential for code reuse from nutch?  Maybe this is topic
for solr developer list?  Or has it been already considered?

Bests,
Ravish
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Mike Klaas
On 5-Oct-07, at 11:59 AM, Ravish Bhagdev wrote:

>> But a different use-case might be for the highlighting to encompass
> the markup rather than >just the text, e.g.
>>   <span class="highlighted"><topic type="location">Paris</topic></
>> span>
>> which would have to be accomplished some other way.
>
> Yes, exactly.  And I think nutch handles this somehow as I remember
> using it for indexing HTML and then returning snippets with accurate
> highlighting placed within html snippets.
>
> Is there a potential for code reuse from nutch?  Maybe this is topic
> for solr developer list?  Or has it been already considered?

Last time I looked at the nutch highlighter I don't remember seeing  
anything about handling this correctly (which would involved a  
considerable amount of html finangling to get perfect).

Also, I don't see the use case for web docs: you absolutely never  
want to serve up the raw html form an unknown page.

I'm not against improving Solr's handling of HTML data, but it is the  
type of thing that is unlikely to happen unless someone who cares  
about it steps up.

Patches welcome :)

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: unable to figure out nutch type highlighting in solr....

Adrian Sutton Symphonious
In reply to this post by steve_rowe
> Named entity references are valid in XML.  They just need to be  
> declared
> before they are used[1], unless they are one of the builtin named
> entities &lt; &gt; &apos; &quot; or &amp; -- these are always valid  
> when
> parsing with an XML parser.

Correct, it was an offhand comment and I skipped over all the  
details. In general named entities other than the built-ins aren't  
declared at the top of the file and many parsers don't bother to read  
in external DTDs so any entities declared there aren't read and are  
therefore considered invalid.

> XHTML is XML, so if parsed by an XML parser, XML's builtin named
> entities are available, and if the parser doesn't ignore external
> entities, then the same set of (roughly) 250 named entities defined in
> HTML are available as well[2].

Except that no browser that I know of actually reads in the XHTML DTD  
when in standards compliant mode, so none of those entities are  
actually viable to be used unless you include the declarations for  
them at the top of every XHTML document (which is ludicrous).

The bottom line is that it's far, far better to use numeric entities  
in XML and simply ignore all but the built-in named entities if you  
want to have any confidence that the document will be parsed  
correctly - hence my offhand comment.

Regards,

Adrian Sutton
http://www.symphonious.net