Highlighting values of non stored fields

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Highlighting values of non stored fields

mosheB
Our use case is as follow:
We are indexing free text documents. Each document contains metadata fields
(such as author, creation date...) which are kinda small, and one "big"
field that holds the document's text itself.

For ranking purpose each field is indexed in more then one "variation" and
query is executed with edismax query parser. Things are working alright, but
now a new feature is requested by the customer - highlighting.
To enable highlighting every field must be stored, including all variations
of the big text field. This pushes our storage to the limit (and probably
the document cache...) and feels a bit redundant, as the stored value is
duplicated n times... Is there any way to “reference” stored value from one
field to another?
For example:
Say we have the following config:
<dynamicField name="*_bigrams” type="bigrams” indexed="true” stored="false”
/>
<dynamicField name="*_phrases” type="phrases” indexed="true” stored="false”
/>

<field name="doc_text” type="text_general” indexed="true” stored="true” />
<copyField source="doc_text” dest="doc_text_bigrams” />
<copyField source="doc_text” dest="doc_text_phrases” />

And we execute the following query:
http://.../select?defType=edismax&q=desired_terms&qf=doc_text^2
doc_text_bigrams^3
doc_text_phrases^4&hl=on&hl.fl=doc_text,doc_text_bigrams,doc_text_phrases

Highlight fragments in response will be blank if match occurred  on the
non-stored fields (doc_text_bigrams or doc_text_phrases). Is it possible to
pass extra parameter to the highlight component, to point it to the stored
data of the “original” doc_text field? a kind of “stored value reference
field”?

Thanks in advance.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting values of non stored fields

Erick Erickson
Why do you think even variants need to be stored/highlighted? Usually
when you store variants for ranking purposes those extra copies are
invisible to the user. So most often people store exactly one copy
of a particular field and highlight _that_ field in the return.

So say my field is f1 and I have indexed f1_1, f1_2, f1_3. I just store
f1_1 and return the highlighted text from that one.

You could even just stored the data only once in a field that’s never
indexed and return/highlight that if you wanted.

Best,
Erick

> On Jun 2, 2020, at 3:24 AM, mosheB <[hidden email]> wrote:
>
> Our use case is as follow:
> We are indexing free text documents. Each document contains metadata fields
> (such as author, creation date...) which are kinda small, and one "big"
> field that holds the document's text itself.
>
> For ranking purpose each field is indexed in more then one "variation" and
> query is executed with edismax query parser. Things are working alright, but
> now a new feature is requested by the customer - highlighting.
> To enable highlighting every field must be stored, including all variations
> of the big text field. This pushes our storage to the limit (and probably
> the document cache...) and feels a bit redundant, as the stored value is
> duplicated n times... Is there any way to “reference” stored value from one
> field to another?
> For example:
> Say we have the following config:
> <dynamicField name="*_bigrams” type="bigrams” indexed="true” stored="false”
> />
> <dynamicField name="*_phrases” type="phrases” indexed="true” stored="false”
> />
>
> <field name="doc_text” type="text_general” indexed="true” stored="true” />
> <copyField source="doc_text” dest="doc_text_bigrams” />
> <copyField source="doc_text” dest="doc_text_phrases” />
>
> And we execute the following query:
> http://.../select?defType=edismax&q=desired_terms&qf=doc_text^2
> doc_text_bigrams^3
> doc_text_phrases^4&hl=on&hl.fl=doc_text,doc_text_bigrams,doc_text_phrases
>
> Highlight fragments in response will be blank if match occurred  on the
> non-stored fields (doc_text_bigrams or doc_text_phrases). Is it possible to
> pass extra parameter to the highlight component, to point it to the stored
> data of the “original” doc_text field? a kind of “stored value reference
> field”?
>
> Thanks in advance.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: Highlighting values of non stored fields

mosheB

Thanks Erick for the reply. Your answer is eaxctly what I was expecting from the highlight component but it seems like I am getting different behaviour.
I'll try to give a simple example and I hope you can explain where is my mistake.
Say I have the following fields configuration:
<field name="doc_text" type="text_ws" indexed="true" stored="true"/>
<dynamicField name="*_lw" type="text_general" indexed="true" stored="false"/>
<copyField source="doc_text" dest="doc_text_lw"/>
 
And I indexed the following document:
{
    "doc_text": "MOSH"
}
 
When executing the following query "http://.../select?q=doc_text_lw:mosh&hl=true&hl.fl=doc_text" - the document is matched and returned in response, but the highlighed fragment is empty.
I also tried to change 'hl.method' param to 'unified' and 'fastVector' but no luck either. My conclusion was that 'hl.fl' param should be set to 'doc_text_lw' and it must be also stored...
 
 
 

Sent: Tuesday, June 02, 2020 at 3:15 PM
From: "Erick Erickson" <[hidden email]>
To: [hidden email]
Subject: Re: Highlighting values of non stored fields
Why do you think even variants need to be stored/highlighted? Usually
when you store variants for ranking purposes those extra copies are
invisible to the user. So most often people store exactly one copy
of a particular field and highlight _that_ field in the return.

So say my field is f1 and I have indexed f1_1, f1_2, f1_3. I just store
f1_1 and return the highlighted text from that one.

You could even just stored the data only once in a field that’s never
indexed and return/highlight that if you wanted.

Best,
Erick

> On Jun 2, 2020, at 3:24 AM, mosheB <[hidden email]> wrote:
>
> Our use case is as follow:
> We are indexing free text documents. Each document contains metadata fields
> (such as author, creation date...) which are kinda small, and one "big"
> field that holds the document's text itself.
>
> For ranking purpose each field is indexed in more then one "variation" and
> query is executed with edismax query parser. Things are working alright, but
> now a new feature is requested by the customer - highlighting.
> To enable highlighting every field must be stored, including all variations
> of the big text field. This pushes our storage to the limit (and probably
> the document cache...) and feels a bit redundant, as the stored value is
> duplicated n times... Is there any way to “reference” stored value from one
> field to another?
> For example:
> Say we have the following config:
> <dynamicField name="*_bigrams” type="bigrams” indexed="true” stored="false”
> />
> <dynamicField name="*_phrases” type="phrases” indexed="true” stored="false”
> />
>
> <field name="doc_text” type="text_general” indexed="true” stored="true” />
> <copyField source="doc_text” dest="doc_text_bigrams” />
> <copyField source="doc_text” dest="doc_text_phrases” />
>
> And we execute the following query:
> http://.../select?defType=edismax&q=desired_terms&qf=doc_text^2
> doc_text_bigrams^3
> doc_text_phrases^4&hl=on&hl.fl=doc_text,doc_text_bigrams,doc_text_phrases
>
> Highlight fragments in response will be blank if match occurred on the
> non-stored fields (doc_text_bigrams or doc_text_phrases). Is it possible to
> pass extra parameter to the highlight component, to point it to the stored
> data of the “original” doc_text field? a kind of “stored value reference
> field”?
>
> Thanks in advance.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
 
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting values of non stored fields

Erick Erickson
When highlighting, the stored data for the field is re-analyzed against the query based on the field you’re highlighting. My bet is that if you query just “q=doc_text:mosh” you will not get a hit. Check your text_ws fieldType, it’s probably case sensitive. So if you changed the doc_text type to text_general (the same as your dynamic field), I think you’d be fine. re-index your data of course….

I’ll add by-the-by that text_ws is a fairly restricted, and is rarely useful for searching on anything humans have to key in. It’ll include punctuation for instance, i.e. input like “dog dog.” will produce two tokens, one with a period in the token and one without. It’s most useful for heavily-preprocessed data where the app normalizes the input or machine-generated input.

There’s no reason, BTW, to index your doc_text for highlighting purposes since the stored data is what counts. Unless, of course, you want to search on that field specifically.

Best,
Erick

> On Jun 7, 2020, at 11:32 PM, mosh bla <[hidden email]> wrote:
>
>
> Thanks Erick for the reply. Your answer is eaxctly what I was expecting from the highlight component but it seems like I am getting different behaviour.
> I'll try to give a simple example and I hope you can explain where is my mistake.
> Say I have the following fields configuration:
> <field name="doc_text" type="text_ws" indexed="true" stored="true"/>
> <dynamicField name="*_lw" type="text_general" indexed="true" stored="false"/>
> <copyField source="doc_text" dest="doc_text_lw"/>
>
> And I indexed the following document:
> {
>    "doc_text": "MOSH"
> }
>
> When executing the following query "http://.../select?q=doc_text_lw:mosh&hl=true&hl.fl=doc_text" - the document is matched and returned in response, but the highlighed fragment is empty.
> I also tried to change 'hl.method' param to 'unified' and 'fastVector' but no luck either. My conclusion was that 'hl.fl' param should be set to 'doc_text_lw' and it must be also stored...
>  
>  
>  
>
> Sent: Tuesday, June 02, 2020 at 3:15 PM
> From: "Erick Erickson" <[hidden email]>
> To: [hidden email]
> Subject: Re: Highlighting values of non stored fields
> Why do you think even variants need to be stored/highlighted? Usually
> when you store variants for ranking purposes those extra copies are
> invisible to the user. So most often people store exactly one copy
> of a particular field and highlight _that_ field in the return.
>
> So say my field is f1 and I have indexed f1_1, f1_2, f1_3. I just store
> f1_1 and return the highlighted text from that one.
>
> You could even just stored the data only once in a field that’s never
> indexed and return/highlight that if you wanted.
>
> Best,
> Erick
>
>> On Jun 2, 2020, at 3:24 AM, mosheB <[hidden email]> wrote:
>>
>> Our use case is as follow:
>> We are indexing free text documents. Each document contains metadata fields
>> (such as author, creation date...) which are kinda small, and one "big"
>> field that holds the document's text itself.
>>
>> For ranking purpose each field is indexed in more then one "variation" and
>> query is executed with edismax query parser. Things are working alright, but
>> now a new feature is requested by the customer - highlighting.
>> To enable highlighting every field must be stored, including all variations
>> of the big text field. This pushes our storage to the limit (and probably
>> the document cache...) and feels a bit redundant, as the stored value is
>> duplicated n times... Is there any way to “reference” stored value from one
>> field to another?
>> For example:
>> Say we have the following config:
>> <dynamicField name="*_bigrams” type="bigrams” indexed="true” stored="false”
>> />
>> <dynamicField name="*_phrases” type="phrases” indexed="true” stored="false”
>> />
>>
>> <field name="doc_text” type="text_general” indexed="true” stored="true” />
>> <copyField source="doc_text” dest="doc_text_bigrams” />
>> <copyField source="doc_text” dest="doc_text_phrases” />
>>
>> And we execute the following query:
>> http://.../select?defType=edismax&q=desired_terms&qf=doc_text^2
>> doc_text_bigrams^3
>> doc_text_phrases^4&hl=on&hl.fl=doc_text,doc_text_bigrams,doc_text_phrases
>>
>> Highlight fragments in response will be blank if match occurred on the
>> non-stored fields (doc_text_bigrams or doc_text_phrases). Is it possible to
>> pass extra parameter to the highlight component, to point it to the stored
>> data of the “original” doc_text field? a kind of “stored value reference
>> field”?
>>
>> Thanks in advance.
>>
>>
>>
>> --
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>  

Reply | Threaded
Open this post in threaded view
|

Re: Highlighting values of non stored fields

mosheB


Thanks Erick, indeed that was my problem and you helped me understand how hl component works, but still I cant understand how can I avoid storing all field’s variations? For example, if I need to support morphological search, I have 2 fields:
<field name="doc_text" type="text_general" indexed="true" stored="true"/>
<field name="doc_text_morph" type="custom_morphological_field" indexed="true" stored="false"/>
<copyField source="doc_text" dest="doc_text_morph"/>
Say we indexed the following doc:
{
    “doc_text”: “walking dead”
}
Following queries should match:
q = walking
q = walk
I am issuing edismax query with qf="doc_text^2 doc_text_morph” (boosts are currently missing) and add highlight params. ‘walk’ will be matched on doc_text_morph, but will only be highlighted iff doc_text_morph is stored (no match on stored field doc_text...). Is there any way to make it highlighted without also storing doc_text_morph field?
Thanks again...
 
 
 
 

Sent: Monday, June 08, 2020 at 3:39 PM
From: "Erick Erickson" <[hidden email]>
To: [hidden email]
Subject: Re: Highlighting values of non stored fields
When highlighting, the stored data for the field is re-analyzed against the query based on the field you’re highlighting. My bet is that if you query just “q=doc_text:mosh” you will not get a hit. Check your text_ws fieldType, it’s probably case sensitive. So if you changed the doc_text type to text_general (the same as your dynamic field), I think you’d be fine. re-index your data of course….

I’ll add by-the-by that text_ws is a fairly restricted, and is rarely useful for searching on anything humans have to key in. It’ll include punctuation for instance, i.e. input like “dog dog.” will produce two tokens, one with a period in the token and one without. It’s most useful for heavily-preprocessed data where the app normalizes the input or machine-generated input.

There’s no reason, BTW, to index your doc_text for highlighting purposes since the stored data is what counts. Unless, of course, you want to search on that field specifically.

Best,
Erick

> On Jun 7, 2020, at 11:32 PM, mosh bla <[hidden email]> wrote:
>
>
> Thanks Erick for the reply. Your answer is eaxctly what I was expecting from the highlight component but it seems like I am getting different behaviour.
> I'll try to give a simple example and I hope you can explain where is my mistake.
> Say I have the following fields configuration:
> <field name="doc_text" type="text_ws" indexed="true" stored="true"/>
> <dynamicField name="*_lw" type="text_general" indexed="true" stored="false"/>
> <copyField source="doc_text" dest="doc_text_lw"/>
>
> And I indexed the following document:
> {
> "doc_text": "MOSH"
> }
>
> When executing the following query "http://.../select?q=doc_text_lw:mosh&hl=true&hl.fl=doc_text" - the document is matched and returned in response, but the highlighed fragment is empty.
> I also tried to change 'hl.method' param to 'unified' and 'fastVector' but no luck either. My conclusion was that 'hl.fl' param should be set to 'doc_text_lw' and it must be also stored...
>
>
>
>
> Sent: Tuesday, June 02, 2020 at 3:15 PM
> From: "Erick Erickson" <[hidden email]>
> To: [hidden email]
> Subject: Re: Highlighting values of non stored fields
> Why do you think even variants need to be stored/highlighted? Usually
> when you store variants for ranking purposes those extra copies are
> invisible to the user. So most often people store exactly one copy
> of a particular field and highlight _that_ field in the return.
>
> So say my field is f1 and I have indexed f1_1, f1_2, f1_3. I just store
> f1_1 and return the highlighted text from that one.
>
> You could even just stored the data only once in a field that’s never
> indexed and return/highlight that if you wanted.
>
> Best,
> Erick
>
>> On Jun 2, 2020, at 3:24 AM, mosheB <[hidden email]> wrote:
>>
>> Our use case is as follow:
>> We are indexing free text documents. Each document contains metadata fields
>> (such as author, creation date...) which are kinda small, and one "big"
>> field that holds the document's text itself.
>>
>> For ranking purpose each field is indexed in more then one "variation" and
>> query is executed with edismax query parser. Things are working alright, but
>> now a new feature is requested by the customer - highlighting.
>> To enable highlighting every field must be stored, including all variations
>> of the big text field. This pushes our storage to the limit (and probably
>> the document cache...) and feels a bit redundant, as the stored value is
>> duplicated n times... Is there any way to “reference” stored value from one
>> field to another?
>> For example:
>> Say we have the following config:
>> <dynamicField name="*_bigrams” type="bigrams” indexed="true” stored="false”
>> />
>> <dynamicField name="*_phrases” type="phrases” indexed="true” stored="false”
>> />
>>
>> <field name="doc_text” type="text_general” indexed="true” stored="true” />
>> <copyField source="doc_text” dest="doc_text_bigrams” />
>> <copyField source="doc_text” dest="doc_text_phrases” />
>>
>> And we execute the following query:
>> http://.../select?defType=edismax&q=desired_terms&qf=doc_text^2
>> doc_text_bigrams^3
>> doc_text_phrases^4&hl=on&hl.fl=doc_text,doc_text_bigrams,doc_text_phrases
>>
>> Highlight fragments in response will be blank if match occurred on the
>> non-stored fields (doc_text_bigrams or doc_text_phrases). Is it possible to
>> pass extra parameter to the highlight component, to point it to the stored
>> data of the “original” doc_text field? a kind of “stored value reference
>> field”?
>>
>> Thanks in advance.
>>
>>
>>
>> --
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>