problems with PhraseHighlighter

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

problems with PhraseHighlighter

iorixxx
Hello everyone,

I am having problems with highlighting the complete text of a field. I have an xml field. I am querying proximity searches on this field.

xml:  ( proximity1 AND/OR proximity2 AND/OR …)

Results are returned successfully satisfying the proximity query. However when I request highlighting sometimes it returns nothing sometimes it returns missing proximity terms.

I set my maxFieldLength to Integer.MAX_VALUE in solrconfig.xml.
<maxFieldLength>2147483647</maxFieldLength>

I am using these highlighting parameters:

hl.maxAnalyzedChars=2147483647
hl.fragsize=2147483647
hl.usePhraseHighlighter=true
hl.requireFieldMatch=true
hl.fl=xml
hl=true

I tried combinations of hl.fragsize=0 and hl.requireFieldMatch=false but it didn’t help. When i set hl.usePhraseHighlighter=false highlighting returns but all query terms are highlighted.

What value of hl.fragsize should I use to highlight complete text of a field? 0 or 2147483647?

What is the highest value that I can set to hl.maxAnalyzedChars and hl.fragsize?

I am querying same field and requesting same field in highlighting. Although a document matches a query no highlighting returns back. What could be the reason?

If a document matches a query, there should be highlighting returning back, right?

Any help or pointers are really appreciated.




Reply | Threaded
Open this post in threaded view
|

Re: problems with PhraseHighlighter

Avlesh Singh
Copy-paste your field definition for the field you are trying to
highlight/search on.

Cheers
Avlesh

On Sun, Nov 1, 2009 at 8:24 PM, AHMET ARSLAN <[hidden email]> wrote:

> Hello everyone,
>
> I am having problems with highlighting the complete text of a field. I have
> an xml field. I am querying proximity searches on this field.
>
> xml:  ( proximity1 AND/OR proximity2 AND/OR …)
>
> Results are returned successfully satisfying the proximity query. However
> when I request highlighting sometimes it returns nothing sometimes it
> returns missing proximity terms.
>
> I set my maxFieldLength to Integer.MAX_VALUE in solrconfig.xml.
> <maxFieldLength>2147483647</maxFieldLength>
>
> I am using these highlighting parameters:
>
> hl.maxAnalyzedChars=2147483647
> hl.fragsize=2147483647
> hl.usePhraseHighlighter=true
> hl.requireFieldMatch=true
> hl.fl=xml
> hl=true
>
> I tried combinations of hl.fragsize=0 and hl.requireFieldMatch=false but it
> didn’t help. When i set hl.usePhraseHighlighter=false highlighting returns
> but all query terms are highlighted.
>
> What value of hl.fragsize should I use to highlight complete text of a
> field? 0 or 2147483647?
>
> What is the highest value that I can set to hl.maxAnalyzedChars and
> hl.fragsize?
>
> I am querying same field and requesting same field in highlighting.
> Although a document matches a query no highlighting returns back. What could
> be the reason?
>
> If a document matches a query, there should be highlighting returning back,
> right?
>
> Any help or pointers are really appreciated.
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: problems with PhraseHighlighter

iorixxx
> Copy-paste your field definition for
> the field you are trying to
> highlight/search on.
>
> Cheers
> Avlesh

Thank you for your interest Avlesh,

My field type mostly contains custom filters and tokenizers.

<fieldType name="XMLText" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
  <tokenizer class="XMLStripStandardTokenizerFactory" />
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true" expand="true" />
  <filter class="CustomStemFilterFactory" protected="protwords.txt" />
  <filter class="LowerCaseFilterFactory" />
  </analyzer>
 <analyzer type="query">
  <tokenizer class="CustomTokenizerFactory" />
  <filter class="CustomDeasciifyFilterFactory" />
  <filter class="CustomStemFilterFactory" protected="protwords.txt" />
  <filter class="LowerCaseFilterFactory" />
  </analyzer>
  </fieldType>


Firstly I tried to use solr.HTMLStripCharFilterFactory to strip xml tags, it works fine but when it comes to highlighting the <em> tags are replaced incorrect position. Same as solr.HTMLStripStandardTokenizerFactory. The <em> tags are inserted interestingly exactly one character before the actual term. So I added a new token definition to StandardTokenizer's jflex file, to recogize xml tags and ingores them. I confirmed that it is working with some testcases. It strips xml tags in tokenizer level. I am doing this because I am displaying original documents with xml + xslt. Therefore i need to highlight xml files to display.

And I am using ComplexPhraseQueryParser [1].

But i reproduced the problem with &defType=lucene&q="term1 term2"~5 I see that term1 and term2 is 5 terms close to each other . Therefore it is returned. But highlighting is empty. And there is no xml tags (stripped by tokenizer) between those terms in the original document.

hl.maxanalyzedchars parameter is about original document, right? I mean in my case including xml tags too.

[1] http://lucene.apache.org/java/2_9_0/api/contrib-misc/org/apache/lucene/queryParser/complexPhrase/package-summary.html