Why does highlight use the index analyzer (instead of query)?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Why does highlight use the index analyzer (instead of query)?

Christian Vogler-3
Hi,

I am using Solr 1.2.0 with a custom compound word analyzer, which inserts the
decompositions into the token stream. Because I assume that when the user
queries for a compound word, he is interested only in whole-word matches, I
have it enabled only in my index analyzer chain.

However, due to a bug in the analyzer (entirely my fault), I came to realize
that when highlighting is enabled, the highlighter uses the index analyzer
chain to find the matches, instead of the query analyzer chain.

I find this curious, and I was wondering whether this is intentional, and if
so, what is the rationale for this?

Best regards
- Christian
Reply | Threaded
Open this post in threaded view
|

Re: Why does highlight use the index analyzer (instead of query)?

hossman

I'm not much of a highligher expert, but this *seems* like it was probably
intentional ... you are tlaking abouthte use case where you have a stored
field, and no term positions correct? ... so in order to highlight, the
highlighter needs to analyzed the stored text to find the word positions?

The "index" analyzer is the one that is intended to be used on the text
stored in documents, while the "query" analyzer is the one intended to be
used on (shorter) query strings ... so when highlighting you use the
"query" analyzer to built up the query object and the terms to search for,
and the "index" analyzer to parse the stored field ... those two
analyzers have to be compatible/complimentary for this to work, butthey
have to be compatible/complimentary in the exact same way forhte
queries to match at all.

also: this way you getthe exact same behavior even if you switch from
storing the field to using TermPositions.


...but like i said: this is just my assumption, i don't know that much
aboutthe highlighter.


: I am using Solr 1.2.0 with a custom compound word analyzer, which inserts the
: decompositions into the token stream. Because I assume that when the user
: queries for a compound word, he is interested only in whole-word matches, I
: have it enabled only in my index analyzer chain.
:
: However, due to a bug in the analyzer (entirely my fault), I came to realize
: that when highlighting is enabled, the highlighter uses the index analyzer
: chain to find the matches, instead of the query analyzer chain.
:
: I find this curious, and I was wondering whether this is intentional, and if
: so, what is the rationale for this?
:
: Best regards
: - Christian
:



-Hoss

Reply | Threaded
Open this post in threaded view
|

Seeing strange highlighting in multi-valued field (was: Why does highlight use the index analyzer)

Christian Vogler-3
On Wednesday 27 February 2008 03:58:14 Chris Hostetter wrote:
> I'm not much of a highligher expert, but this *seems* like it was probably
> intentional ... you are tlaking abouthte use case where you have a stored
> field, and no term positions correct? ... so in order to highlight, the
> highlighter needs to analyzed the stored text to find the word positions?

Yes, that is correct. I index and store the field, and have term positions
disabled. Your explanation makes sense, thanks.

However, to follow up, I have run into some strange highlighter behavior on
multi-valued text fields. In particular, I have a field like this:

<fieldType name="text_de" class="solr.TextField"
positionIncrementGap="100">...</fieldType>

The analyzers for indexing and query are identical, except that I put a
compound word splitter in the indexer chain. I use this in a multi-valued
category field:

<field name="category" type="text_de" indexed="true" stored="true"
multiValued="true" />

Typical values from documents are:
<arr name="category"><str>Gebärdensprache</str><str>Recht</str></arr>

where the indexed terms, after analysis are: "gebard" "sprach" and "recht",
respectively. Now, if I query for "Gebärden" (which the analyzer transforms
into "gebard"), I get matches, as expected, but the highlighter retrieves
only the match on the first token of the first field, like this:

<arr name="category"><str>&lt;em&gt;Gebärden&lt;/em&gt;</str></arr>

The fragment, snippet, and merging parameters have no effect on this behavior;
hl.requireFieldMatch is off; hl.fragmenter is gap.

What is a bit strange is that If the field have only one value, then the
highlighter retrieves the entire contents of the field; that is, if we have
indexed

<arr name="category"><str>Gebärdensprache</str></arr>

then the highlighter will show

<arr name="category"><str>&lt;em&gt;Gebärden&lt;/em&gt;sprache</str></arr>

which is the behavior that I expected, irrespective of whether the field has
one or more values.

Any idea what could be going on here?

Best regards
- Christian
Reply | Threaded
Open this post in threaded view
|

Re: Seeing strange highlighting in multi-valued field (was: Why does highlight use the index analyzer)

hossman

: which is the behavior that I expected, irrespective of whether the field has
: one or more values.
:
: Any idea what could be going on here?

not really ... but like i said, i'm not really a "highlighter guy".  I
can't think of any reason why having multiple values would cause this
behavior ... does the behavior change if the "value" that matches isn't
the first one?  what if positionIncrimentGap="0" ?

either way, it seems like a bug to me ... unless someone else chimes in
with a "that's by design because..." reply, i would open a bug and attach
a small test case demonstrating the problem (which should be fairly
straightforward since it doens't require a lot of data)



-Hoss