Highlighting large text fields

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Highlighting large text fields

Shaun Campbell
I've been using highlighting for a while, using the original highlighter,
and just come across a problem with fields that contain a large amount of
text, approx 250k characters. I only have about 2,000 records but each one
contains a journal publication to search through.

What I noticed is that some records didn't return a highlight even though
they matched on the content. I noticed the hl.maxAnalyzedChars parameter
and increased that, but  it allowed some records to be highlighted, but not
all, and then it caused memory problems on the server.  Performance is also
very poor.

To try to fix this I've tried  to configure the unified highlighter in my
solrconfig.xml instead.   It seems to be working but again I'm missing some
highlighted records.

The other thing is I've tried to adjust my unified highlighting settings in
solrconfig.xml and they don't  seem to be having any effect even after
restarting Solr.  I was just wondering whether there is any highlighting
information stored at index time. It's taking over 4hours to index my
records so it's not easy to keep reindexing my content.

Any ideas on how to handle highlighting of large content  would be
appreciated.

Shaun
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting large text fields

David Smiley
Hello!

I worked on the UnifiedHighlighter a lot and want to help you!

On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <[hidden email]>
wrote:

> I've been using highlighting for a while, using the original highlighter,
> and just come across a problem with fields that contain a large amount of
> text, approx 250k characters. I only have about 2,000 records but each one
> contains a journal publication to search through.
>
> What I noticed is that some records didn't return a highlight even though
> they matched on the content. I noticed the hl.maxAnalyzedChars parameter
> and increased that, but  it allowed some records to be highlighted, but not
> all, and then it caused memory problems on the server.  Performance is also
> very poor.
>

I've been thinking hl.maxAnalyzedChars should maybe default to no limit --
it's a performance threshold but perhaps better to opt-in to such a limit
then scratch your head for a long time wondering why a search result isn't
showing highlights.


> To try to fix this I've tried  to configure the unified highlighter in my
> solrconfig.xml instead.   It seems to be working but again I'm missing some
> highlighted records.
>

There is no configuration of that highlighter in solrconfig.xml; it's
entirely parameter driven (runtime).


> The other thing is I've tried to adjust my unified highlighting settings in
> solrconfig.xml and they don't  seem to be having any effect even after
> restarting Solr.  I was just wondering whether there is any highlighting
> information stored at index time. It's taking over 4hours to index my
> records so it's not easy to keep reindexing my content.
>
> Any ideas on how to handle highlighting of large content  would be
> appreciated.
>
> Shaun
>

Please read the documentation here thoroughly:
https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
(or earlier version as applicable)
Since you have large bodies of text to highlight, you would strongly
benefit from putting offsets into the search index (and re-index) --
storeOffsetsWithPositions.  That's an option on the field/fieldType in your
schema; it may not be obvious reading the docs.  You have to opt-in to
that; Solr doesn't normally store any info in the index for highlighting.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting large text fields

Shaun Campbell
Hi David

First of all I wanted to say I'm working off your book!!  Third edition,
and I think it's a bit out of date now. I was just going to try following
the section on the Postings highlighter, but I see that's been absorbed
into the Unified highlighter. I find your book easier to follow than the
official documentation though.

I am going to try to configure the unified highlighter, and I will add that
storeOffsetsWithPositions to the schema (which I saw in your book) and I
will try indexing again from scratch.  Was getting some funny things going
on where I thought I'd turned highlighting off and it was still giving me
highlights.

Actually just re-reading your email again, are you saying that you can't
configure highlighting in solrconfig.xml? That's where I always configure
original highlighting in my dismax search handler. Am I supposed to add
highlighting to each request?

Thanks
Shaun

On Mon, 11 Jan 2021 at 20:57, David Smiley <[hidden email]> wrote:

> Hello!
>
> I worked on the UnifiedHighlighter a lot and want to help you!
>
> On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <[hidden email]>
> wrote:
>
> > I've been using highlighting for a while, using the original highlighter,
> > and just come across a problem with fields that contain a large amount of
> > text, approx 250k characters. I only have about 2,000 records but each
> one
> > contains a journal publication to search through.
> >
> > What I noticed is that some records didn't return a highlight even though
> > they matched on the content. I noticed the hl.maxAnalyzedChars parameter
> > and increased that, but  it allowed some records to be highlighted, but
> not
> > all, and then it caused memory problems on the server.  Performance is
> also
> > very poor.
> >
>
> I've been thinking hl.maxAnalyzedChars should maybe default to no limit --
> it's a performance threshold but perhaps better to opt-in to such a limit
> then scratch your head for a long time wondering why a search result isn't
> showing highlights.
>
>
> > To try to fix this I've tried  to configure the unified highlighter in my
> > solrconfig.xml instead.   It seems to be working but again I'm missing
> some
> > highlighted records.
> >
>
> There is no configuration of that highlighter in solrconfig.xml; it's
> entirely parameter driven (runtime).
>
>
> > The other thing is I've tried to adjust my unified highlighting settings
> in
> > solrconfig.xml and they don't  seem to be having any effect even after
> > restarting Solr.  I was just wondering whether there is any highlighting
> > information stored at index time. It's taking over 4hours to index my
> > records so it's not easy to keep reindexing my content.
> >
> > Any ideas on how to handle highlighting of large content  would be
> > appreciated.
> >
> > Shaun
> >
>
> Please read the documentation here thoroughly:
>
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> (or earlier version as applicable)
> Since you have large bodies of text to highlight, you would strongly
> benefit from putting offsets into the search index (and re-index) --
> storeOffsetsWithPositions.  That's an option on the field/fieldType in your
> schema; it may not be obvious reading the docs.  You have to opt-in to
> that; Solr doesn't normally store any info in the index for highlighting.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting large text fields

David Smiley
On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <[hidden email]>
wrote:

> Hi David
>
> First of all I wanted to say I'm working off your book!!  Third edition,
> and I think it's a bit out of date now. I was just going to try following
> the section on the Postings highlighter, but I see that's been absorbed
> into the Unified highlighter. I find your book easier to follow than the
> official documentation though.
>

Thanks :-D.  I do maintain the Solr Reference Guide for the parts of code I
touch, including highlighting, so I hope what's there makes sense too.


> I am going to try to configure the unified highlighter, and I will add that
> storeOffsetsWithPositions to the schema (which I saw in your book) and I
> will try indexing again from scratch.  Was getting some funny things going
> on where I thought I'd turned highlighting off and it was still giving me
> highlights.
>

hl=true/false


> Actually just re-reading your email again, are you saying that you can't
> configure highlighting in solrconfig.xml? That's where I always configure
> original highlighting in my dismax search handler. Am I supposed to add
> highlighting to each request?
>

You can set highlighting and other *parameters* in solrconfig.xml for
request handlers.  But the dedicated <highlighting> plugin info is only for
the original and Fast Vector Highlighters.

~ David


>
> Thanks
> Shaun
>
> On Mon, 11 Jan 2021 at 20:57, David Smiley <[hidden email]> wrote:
>
> > Hello!
> >
> > I worked on the UnifiedHighlighter a lot and want to help you!
> >
> > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <[hidden email]
> >
> > wrote:
> >
> > > I've been using highlighting for a while, using the original
> highlighter,
> > > and just come across a problem with fields that contain a large amount
> of
> > > text, approx 250k characters. I only have about 2,000 records but each
> > one
> > > contains a journal publication to search through.
> > >
> > > What I noticed is that some records didn't return a highlight even
> though
> > > they matched on the content. I noticed the hl.maxAnalyzedChars
> parameter
> > > and increased that, but  it allowed some records to be highlighted, but
> > not
> > > all, and then it caused memory problems on the server.  Performance is
> > also
> > > very poor.
> > >
> >
> > I've been thinking hl.maxAnalyzedChars should maybe default to no limit
> --
> > it's a performance threshold but perhaps better to opt-in to such a limit
> > then scratch your head for a long time wondering why a search result
> isn't
> > showing highlights.
> >
> >
> > > To try to fix this I've tried  to configure the unified highlighter in
> my
> > > solrconfig.xml instead.   It seems to be working but again I'm missing
> > some
> > > highlighted records.
> > >
> >
> > There is no configuration of that highlighter in solrconfig.xml; it's
> > entirely parameter driven (runtime).
> >
> >
> > > The other thing is I've tried to adjust my unified highlighting
> settings
> > in
> > > solrconfig.xml and they don't  seem to be having any effect even after
> > > restarting Solr.  I was just wondering whether there is any
> highlighting
> > > information stored at index time. It's taking over 4hours to index my
> > > records so it's not easy to keep reindexing my content.
> > >
> > > Any ideas on how to handle highlighting of large content  would be
> > > appreciated.
> > >
> > > Shaun
> > >
> >
> > Please read the documentation here thoroughly:
> >
> >
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> > (or earlier version as applicable)
> > Since you have large bodies of text to highlight, you would strongly
> > benefit from putting offsets into the search index (and re-index) --
> > storeOffsetsWithPositions.  That's an option on the field/fieldType in
> your
> > schema; it may not be obvious reading the docs.  You have to opt-in to
> > that; Solr doesn't normally store any info in the index for highlighting.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting large text fields

Shaun Campbell
Hi David

Getting closer now.

First of all, a bit of a mistake on my part. I have two cores set up and I
was changing the solrconfig.xml on the wrong core doh!!  That's why
highlighting wasn't being turned off.

I think I've got the unified highlighter working.
storeOffsetsWithPositions was already configured on my field type
definition, not the field definition, so that was ok.

What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
highlighting on some records and not others, making it confusing as to
where the match is with my dismax parser.  I increased
my hl.maxAnalyzedChars to 1300000 and now it's highlighting more records.
Two questions:

1. Have you any guidelines as to what could be a
maximum hl.maxAnalyzedChars without impacting performance or memory?

2. Do you know a way to query the maximum length of text in a field so that
I can set hl.maxAnalyzedChars accordingly?  Just thinking I can probably
modify my java indexer to log the maximum content length.  Actually, I
probably don't want the maximum but some value that highlights 90-95%
records

Thanks
Shaun

On Tue, 12 Jan 2021 at 16:30, David Smiley <[hidden email]> wrote:

> On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <[hidden email]>
> wrote:
>
> > Hi David
> >
> > First of all I wanted to say I'm working off your book!!  Third edition,
> > and I think it's a bit out of date now. I was just going to try following
> > the section on the Postings highlighter, but I see that's been absorbed
> > into the Unified highlighter. I find your book easier to follow than the
> > official documentation though.
> >
>
> Thanks :-D.  I do maintain the Solr Reference Guide for the parts of code I
> touch, including highlighting, so I hope what's there makes sense too.
>
>
> > I am going to try to configure the unified highlighter, and I will add
> that
> > storeOffsetsWithPositions to the schema (which I saw in your book) and I
> > will try indexing again from scratch.  Was getting some funny things
> going
> > on where I thought I'd turned highlighting off and it was still giving me
> > highlights.
> >
>
> hl=true/false
>
>
> > Actually just re-reading your email again, are you saying that you can't
> > configure highlighting in solrconfig.xml? That's where I always configure
> > original highlighting in my dismax search handler. Am I supposed to add
> > highlighting to each request?
> >
>
> You can set highlighting and other *parameters* in solrconfig.xml for
> request handlers.  But the dedicated <highlighting> plugin info is only for
> the original and Fast Vector Highlighters.
>
> ~ David
>
>
> >
> > Thanks
> > Shaun
> >
> > On Mon, 11 Jan 2021 at 20:57, David Smiley <[hidden email]> wrote:
> >
> > > Hello!
> > >
> > > I worked on the UnifiedHighlighter a lot and want to help you!
> > >
> > > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > I've been using highlighting for a while, using the original
> > highlighter,
> > > > and just come across a problem with fields that contain a large
> amount
> > of
> > > > text, approx 250k characters. I only have about 2,000 records but
> each
> > > one
> > > > contains a journal publication to search through.
> > > >
> > > > What I noticed is that some records didn't return a highlight even
> > though
> > > > they matched on the content. I noticed the hl.maxAnalyzedChars
> > parameter
> > > > and increased that, but  it allowed some records to be highlighted,
> but
> > > not
> > > > all, and then it caused memory problems on the server.  Performance
> is
> > > also
> > > > very poor.
> > > >
> > >
> > > I've been thinking hl.maxAnalyzedChars should maybe default to no limit
> > --
> > > it's a performance threshold but perhaps better to opt-in to such a
> limit
> > > then scratch your head for a long time wondering why a search result
> > isn't
> > > showing highlights.
> > >
> > >
> > > > To try to fix this I've tried  to configure the unified highlighter
> in
> > my
> > > > solrconfig.xml instead.   It seems to be working but again I'm
> missing
> > > some
> > > > highlighted records.
> > > >
> > >
> > > There is no configuration of that highlighter in solrconfig.xml; it's
> > > entirely parameter driven (runtime).
> > >
> > >
> > > > The other thing is I've tried to adjust my unified highlighting
> > settings
> > > in
> > > > solrconfig.xml and they don't  seem to be having any effect even
> after
> > > > restarting Solr.  I was just wondering whether there is any
> > highlighting
> > > > information stored at index time. It's taking over 4hours to index my
> > > > records so it's not easy to keep reindexing my content.
> > > >
> > > > Any ideas on how to handle highlighting of large content  would be
> > > > appreciated.
> > > >
> > > > Shaun
> > > >
> > >
> > > Please read the documentation here thoroughly:
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> > > (or earlier version as applicable)
> > > Since you have large bodies of text to highlight, you would strongly
> > > benefit from putting offsets into the search index (and re-index) --
> > > storeOffsetsWithPositions.  That's an option on the field/fieldType in
> > your
> > > schema; it may not be obvious reading the docs.  You have to opt-in to
> > > that; Solr doesn't normally store any info in the index for
> highlighting.
> > >
> > > ~ David Smiley
> > > Apache Lucene/Solr Search Developer
> > > http://www.linkedin.com/in/davidwsmiley
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting large text fields

David Smiley
On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell <[hidden email]>
wrote:

> Hi David
>
> Getting closer now.
>
> First of all, a bit of a mistake on my part. I have two cores set up and I
> was changing the solrconfig.xml on the wrong core doh!!  That's why
> highlighting wasn't being turned off.
>
> I think I've got the unified highlighter working.
> storeOffsetsWithPositions was already configured on my field type
> definition, not the field definition, so that was ok.
>
> What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> highlighting on some records and not others, making it confusing as to
> where the match is with my dismax parser.  I increased
> my hl.maxAnalyzedChars to 1300000 and now it's highlighting more records.
> Two questions:
>
> 1. Have you any guidelines as to what could be a
> maximum hl.maxAnalyzedChars without impacting performance or memory?
>

With storeOffsetsWithPositions, highlighting is super-fast, and so this
hl.maxAnalyzedChars threshold is of marginal utility, like only to cap the
amount of memory used if you have some truly humongous docs and it's okay
only highlight the first X megabytes of them.  Maybe set to a 100MB worth
of text, or something like that.


> 2. Do you know a way to query the maximum length of text in a field so that
> I can set hl.maxAnalyzedChars accordingly?  Just thinking I can probably
> modify my java indexer to log the maximum content length.  Actually, I
> probably don't want the maximum but some value that highlights 90-95%
> records
>

Eh... not really.  Maybe some approximation hacks involving function
queries on norms but I'd not bother in favor of just using a high threshold
such that this won't be an issue.

All this said, this threshold is *not* the only reason why you might not be
getting highlights that you expect.  If you are using a recent Solr
version, you might try toggling the hl.weightMatches boolean, which could
make a difference for certain query arrangements.  There's a JIRA issue
pertaining to this one, and I haven't investigated it yet.

~ David


>
> Thanks
> Shaun
>
> On Tue, 12 Jan 2021 at 16:30, David Smiley <[hidden email]> wrote:
>
> > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <[hidden email]
> >
> > wrote:
> >
> > > Hi David
> > >
> > > First of all I wanted to say I'm working off your book!!  Third
> edition,
> > > and I think it's a bit out of date now. I was just going to try
> following
> > > the section on the Postings highlighter, but I see that's been absorbed
> > > into the Unified highlighter. I find your book easier to follow than
> the
> > > official documentation though.
> > >
> >
> > Thanks :-D.  I do maintain the Solr Reference Guide for the parts of
> code I
> > touch, including highlighting, so I hope what's there makes sense too.
> >
> >
> > > I am going to try to configure the unified highlighter, and I will add
> > that
> > > storeOffsetsWithPositions to the schema (which I saw in your book) and
> I
> > > will try indexing again from scratch.  Was getting some funny things
> > going
> > > on where I thought I'd turned highlighting off and it was still giving
> me
> > > highlights.
> > >
> >
> > hl=true/false
> >
> >
> > > Actually just re-reading your email again, are you saying that you
> can't
> > > configure highlighting in solrconfig.xml? That's where I always
> configure
> > > original highlighting in my dismax search handler. Am I supposed to add
> > > highlighting to each request?
> > >
> >
> > You can set highlighting and other *parameters* in solrconfig.xml for
> > request handlers.  But the dedicated <highlighting> plugin info is only
> for
> > the original and Fast Vector Highlighters.
> >
> > ~ David
> >
> >
> > >
> > > Thanks
> > > Shaun
> > >
> > > On Mon, 11 Jan 2021 at 20:57, David Smiley <[hidden email]> wrote:
> > >
> > > > Hello!
> > > >
> > > > I worked on the UnifiedHighlighter a lot and want to help you!
> > > >
> > > > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > I've been using highlighting for a while, using the original
> > > highlighter,
> > > > > and just come across a problem with fields that contain a large
> > amount
> > > of
> > > > > text, approx 250k characters. I only have about 2,000 records but
> > each
> > > > one
> > > > > contains a journal publication to search through.
> > > > >
> > > > > What I noticed is that some records didn't return a highlight even
> > > though
> > > > > they matched on the content. I noticed the hl.maxAnalyzedChars
> > > parameter
> > > > > and increased that, but  it allowed some records to be highlighted,
> > but
> > > > not
> > > > > all, and then it caused memory problems on the server.  Performance
> > is
> > > > also
> > > > > very poor.
> > > > >
> > > >
> > > > I've been thinking hl.maxAnalyzedChars should maybe default to no
> limit
> > > --
> > > > it's a performance threshold but perhaps better to opt-in to such a
> > limit
> > > > then scratch your head for a long time wondering why a search result
> > > isn't
> > > > showing highlights.
> > > >
> > > >
> > > > > To try to fix this I've tried  to configure the unified highlighter
> > in
> > > my
> > > > > solrconfig.xml instead.   It seems to be working but again I'm
> > missing
> > > > some
> > > > > highlighted records.
> > > > >
> > > >
> > > > There is no configuration of that highlighter in solrconfig.xml; it's
> > > > entirely parameter driven (runtime).
> > > >
> > > >
> > > > > The other thing is I've tried to adjust my unified highlighting
> > > settings
> > > > in
> > > > > solrconfig.xml and they don't  seem to be having any effect even
> > after
> > > > > restarting Solr.  I was just wondering whether there is any
> > > highlighting
> > > > > information stored at index time. It's taking over 4hours to index
> my
> > > > > records so it's not easy to keep reindexing my content.
> > > > >
> > > > > Any ideas on how to handle highlighting of large content  would be
> > > > > appreciated.
> > > > >
> > > > > Shaun
> > > > >
> > > >
> > > > Please read the documentation here thoroughly:
> > > >
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> > > > (or earlier version as applicable)
> > > > Since you have large bodies of text to highlight, you would strongly
> > > > benefit from putting offsets into the search index (and re-index) --
> > > > storeOffsetsWithPositions.  That's an option on the field/fieldType
> in
> > > your
> > > > schema; it may not be obvious reading the docs.  You have to opt-in
> to
> > > > that; Solr doesn't normally store any info in the index for
> > highlighting.
> > > >
> > > > ~ David Smiley
> > > > Apache Lucene/Solr Search Developer
> > > > http://www.linkedin.com/in/davidwsmiley
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting large text fields

Shaun Campbell
That's great David.  So hl.maxAnalyzedChars isn't that critical. I'll whack
it right up and see what happens.

I'm running 7.4 from a few years ago. Should I upgrade?

For your info this is what I'm doing with Solr
https://dev.fundingawards.nihr.ac.uk/search.

Thanks
Shaun

On Tue, 12 Jan 2021 at 19:33, David Smiley <[hidden email]> wrote:

> On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell <[hidden email]>
> wrote:
>
> > Hi David
> >
> > Getting closer now.
> >
> > First of all, a bit of a mistake on my part. I have two cores set up and
> I
> > was changing the solrconfig.xml on the wrong core doh!!  That's why
> > highlighting wasn't being turned off.
> >
> > I think I've got the unified highlighter working.
> > storeOffsetsWithPositions was already configured on my field type
> > definition, not the field definition, so that was ok.
> >
> > What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> > highlighting on some records and not others, making it confusing as to
> > where the match is with my dismax parser.  I increased
> > my hl.maxAnalyzedChars to 1300000 and now it's highlighting more records.
> > Two questions:
> >
> > 1. Have you any guidelines as to what could be a
> > maximum hl.maxAnalyzedChars without impacting performance or memory?
> >
>
> With storeOffsetsWithPositions, highlighting is super-fast, and so this
> hl.maxAnalyzedChars threshold is of marginal utility, like only to cap the
> amount of memory used if you have some truly humongous docs and it's okay
> only highlight the first X megabytes of them.  Maybe set to a 100MB worth
> of text, or something like that.
>
>
> > 2. Do you know a way to query the maximum length of text in a field so
> that
> > I can set hl.maxAnalyzedChars accordingly?  Just thinking I can probably
> > modify my java indexer to log the maximum content length.  Actually, I
> > probably don't want the maximum but some value that highlights 90-95%
> > records
> >
>
> Eh... not really.  Maybe some approximation hacks involving function
> queries on norms but I'd not bother in favor of just using a high threshold
> such that this won't be an issue.
>
> All this said, this threshold is *not* the only reason why you might not be
> getting highlights that you expect.  If you are using a recent Solr
> version, you might try toggling the hl.weightMatches boolean, which could
> make a difference for certain query arrangements.  There's a JIRA issue
> pertaining to this one, and I haven't investigated it yet.
>
> ~ David
>
>
> >
> > Thanks
> > Shaun
> >
> > On Tue, 12 Jan 2021 at 16:30, David Smiley <[hidden email]> wrote:
> >
> > > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > Hi David
> > > >
> > > > First of all I wanted to say I'm working off your book!!  Third
> > edition,
> > > > and I think it's a bit out of date now. I was just going to try
> > following
> > > > the section on the Postings highlighter, but I see that's been
> absorbed
> > > > into the Unified highlighter. I find your book easier to follow than
> > the
> > > > official documentation though.
> > > >
> > >
> > > Thanks :-D.  I do maintain the Solr Reference Guide for the parts of
> > code I
> > > touch, including highlighting, so I hope what's there makes sense too.
> > >
> > >
> > > > I am going to try to configure the unified highlighter, and I will
> add
> > > that
> > > > storeOffsetsWithPositions to the schema (which I saw in your book)
> and
> > I
> > > > will try indexing again from scratch.  Was getting some funny things
> > > going
> > > > on where I thought I'd turned highlighting off and it was still
> giving
> > me
> > > > highlights.
> > > >
> > >
> > > hl=true/false
> > >
> > >
> > > > Actually just re-reading your email again, are you saying that you
> > can't
> > > > configure highlighting in solrconfig.xml? That's where I always
> > configure
> > > > original highlighting in my dismax search handler. Am I supposed to
> add
> > > > highlighting to each request?
> > > >
> > >
> > > You can set highlighting and other *parameters* in solrconfig.xml for
> > > request handlers.  But the dedicated <highlighting> plugin info is only
> > for
> > > the original and Fast Vector Highlighters.
> > >
> > > ~ David
> > >
> > >
> > > >
> > > > Thanks
> > > > Shaun
> > > >
> > > > On Mon, 11 Jan 2021 at 20:57, David Smiley <[hidden email]>
> wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > I worked on the UnifiedHighlighter a lot and want to help you!
> > > > >
> > > > > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I've been using highlighting for a while, using the original
> > > > highlighter,
> > > > > > and just come across a problem with fields that contain a large
> > > amount
> > > > of
> > > > > > text, approx 250k characters. I only have about 2,000 records but
> > > each
> > > > > one
> > > > > > contains a journal publication to search through.
> > > > > >
> > > > > > What I noticed is that some records didn't return a highlight
> even
> > > > though
> > > > > > they matched on the content. I noticed the hl.maxAnalyzedChars
> > > > parameter
> > > > > > and increased that, but  it allowed some records to be
> highlighted,
> > > but
> > > > > not
> > > > > > all, and then it caused memory problems on the server.
> Performance
> > > is
> > > > > also
> > > > > > very poor.
> > > > > >
> > > > >
> > > > > I've been thinking hl.maxAnalyzedChars should maybe default to no
> > limit
> > > > --
> > > > > it's a performance threshold but perhaps better to opt-in to such a
> > > limit
> > > > > then scratch your head for a long time wondering why a search
> result
> > > > isn't
> > > > > showing highlights.
> > > > >
> > > > >
> > > > > > To try to fix this I've tried  to configure the unified
> highlighter
> > > in
> > > > my
> > > > > > solrconfig.xml instead.   It seems to be working but again I'm
> > > missing
> > > > > some
> > > > > > highlighted records.
> > > > > >
> > > > >
> > > > > There is no configuration of that highlighter in solrconfig.xml;
> it's
> > > > > entirely parameter driven (runtime).
> > > > >
> > > > >
> > > > > > The other thing is I've tried to adjust my unified highlighting
> > > > settings
> > > > > in
> > > > > > solrconfig.xml and they don't  seem to be having any effect even
> > > after
> > > > > > restarting Solr.  I was just wondering whether there is any
> > > > highlighting
> > > > > > information stored at index time. It's taking over 4hours to
> index
> > my
> > > > > > records so it's not easy to keep reindexing my content.
> > > > > >
> > > > > > Any ideas on how to handle highlighting of large content  would
> be
> > > > > > appreciated.
> > > > > >
> > > > > > Shaun
> > > > > >
> > > > >
> > > > > Please read the documentation here thoroughly:
> > > > >
> > > > >
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> > > > > (or earlier version as applicable)
> > > > > Since you have large bodies of text to highlight, you would
> strongly
> > > > > benefit from putting offsets into the search index (and re-index)
> --
> > > > > storeOffsetsWithPositions.  That's an option on the field/fieldType
> > in
> > > > your
> > > > > schema; it may not be obvious reading the docs.  You have to opt-in
> > to
> > > > > that; Solr doesn't normally store any info in the index for
> > > highlighting.
> > > > >
> > > > > ~ David Smiley
> > > > > Apache Lucene/Solr Search Developer
> > > > > http://www.linkedin.com/in/davidwsmiley
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting large text fields

David Smiley
The last update to highlighting that I think is pertinent to
whether highlights match or not is v7.6 which added that hl.weightMatches
option.  So I recommend upgrading to at least that if you want to
experiment further.  But... uh.weightMatches highlights more accurately and
as such is more likely to not highlight as much as you are highlighting
now, and highlighting more is your goal right now it appears.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Jan 12, 2021 at 2:45 PM Shaun Campbell <[hidden email]>
wrote:

> That's great David.  So hl.maxAnalyzedChars isn't that critical. I'll whack
> it right up and see what happens.
>
> I'm running 7.4 from a few years ago. Should I upgrade?
>
> For your info this is what I'm doing with Solr
> https://dev.fundingawards.nihr.ac.uk/search.
>
> Thanks
> Shaun
>
> On Tue, 12 Jan 2021 at 19:33, David Smiley <[hidden email]> wrote:
>
> > On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell <[hidden email]
> >
> > wrote:
> >
> > > Hi David
> > >
> > > Getting closer now.
> > >
> > > First of all, a bit of a mistake on my part. I have two cores set up
> and
> > I
> > > was changing the solrconfig.xml on the wrong core doh!!  That's why
> > > highlighting wasn't being turned off.
> > >
> > > I think I've got the unified highlighter working.
> > > storeOffsetsWithPositions was already configured on my field type
> > > definition, not the field definition, so that was ok.
> > >
> > > What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> > > highlighting on some records and not others, making it confusing as to
> > > where the match is with my dismax parser.  I increased
> > > my hl.maxAnalyzedChars to 1300000 and now it's highlighting more
> records.
> > > Two questions:
> > >
> > > 1. Have you any guidelines as to what could be a
> > > maximum hl.maxAnalyzedChars without impacting performance or memory?
> > >
> >
> > With storeOffsetsWithPositions, highlighting is super-fast, and so this
> > hl.maxAnalyzedChars threshold is of marginal utility, like only to cap
> the
> > amount of memory used if you have some truly humongous docs and it's okay
> > only highlight the first X megabytes of them.  Maybe set to a 100MB worth
> > of text, or something like that.
> >
> >
> > > 2. Do you know a way to query the maximum length of text in a field so
> > that
> > > I can set hl.maxAnalyzedChars accordingly?  Just thinking I can
> probably
> > > modify my java indexer to log the maximum content length.  Actually, I
> > > probably don't want the maximum but some value that highlights 90-95%
> > > records
> > >
> >
> > Eh... not really.  Maybe some approximation hacks involving function
> > queries on norms but I'd not bother in favor of just using a high
> threshold
> > such that this won't be an issue.
> >
> > All this said, this threshold is *not* the only reason why you might not
> be
> > getting highlights that you expect.  If you are using a recent Solr
> > version, you might try toggling the hl.weightMatches boolean, which could
> > make a difference for certain query arrangements.  There's a JIRA issue
> > pertaining to this one, and I haven't investigated it yet.
> >
> > ~ David
> >
> >
> > >
> > > Thanks
> > > Shaun
> > >
> > > On Tue, 12 Jan 2021 at 16:30, David Smiley <[hidden email]> wrote:
> > >
> > > > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > Hi David
> > > > >
> > > > > First of all I wanted to say I'm working off your book!!  Third
> > > edition,
> > > > > and I think it's a bit out of date now. I was just going to try
> > > following
> > > > > the section on the Postings highlighter, but I see that's been
> > absorbed
> > > > > into the Unified highlighter. I find your book easier to follow
> than
> > > the
> > > > > official documentation though.
> > > > >
> > > >
> > > > Thanks :-D.  I do maintain the Solr Reference Guide for the parts of
> > > code I
> > > > touch, including highlighting, so I hope what's there makes sense
> too.
> > > >
> > > >
> > > > > I am going to try to configure the unified highlighter, and I will
> > add
> > > > that
> > > > > storeOffsetsWithPositions to the schema (which I saw in your book)
> > and
> > > I
> > > > > will try indexing again from scratch.  Was getting some funny
> things
> > > > going
> > > > > on where I thought I'd turned highlighting off and it was still
> > giving
> > > me
> > > > > highlights.
> > > > >
> > > >
> > > > hl=true/false
> > > >
> > > >
> > > > > Actually just re-reading your email again, are you saying that you
> > > can't
> > > > > configure highlighting in solrconfig.xml? That's where I always
> > > configure
> > > > > original highlighting in my dismax search handler. Am I supposed to
> > add
> > > > > highlighting to each request?
> > > > >
> > > >
> > > > You can set highlighting and other *parameters* in solrconfig.xml for
> > > > request handlers.  But the dedicated <highlighting> plugin info is
> only
> > > for
> > > > the original and Fast Vector Highlighters.
> > > >
> > > > ~ David
> > > >
> > > >
> > > > >
> > > > > Thanks
> > > > > Shaun
> > > > >
> > > > > On Mon, 11 Jan 2021 at 20:57, David Smiley <[hidden email]>
> > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > I worked on the UnifiedHighlighter a lot and want to help you!
> > > > > >
> > > > > > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <
> > > > [hidden email]
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I've been using highlighting for a while, using the original
> > > > > highlighter,
> > > > > > > and just come across a problem with fields that contain a large
> > > > amount
> > > > > of
> > > > > > > text, approx 250k characters. I only have about 2,000 records
> but
> > > > each
> > > > > > one
> > > > > > > contains a journal publication to search through.
> > > > > > >
> > > > > > > What I noticed is that some records didn't return a highlight
> > even
> > > > > though
> > > > > > > they matched on the content. I noticed the hl.maxAnalyzedChars
> > > > > parameter
> > > > > > > and increased that, but  it allowed some records to be
> > highlighted,
> > > > but
> > > > > > not
> > > > > > > all, and then it caused memory problems on the server.
> > Performance
> > > > is
> > > > > > also
> > > > > > > very poor.
> > > > > > >
> > > > > >
> > > > > > I've been thinking hl.maxAnalyzedChars should maybe default to no
> > > limit
> > > > > --
> > > > > > it's a performance threshold but perhaps better to opt-in to
> such a
> > > > limit
> > > > > > then scratch your head for a long time wondering why a search
> > result
> > > > > isn't
> > > > > > showing highlights.
> > > > > >
> > > > > >
> > > > > > > To try to fix this I've tried  to configure the unified
> > highlighter
> > > > in
> > > > > my
> > > > > > > solrconfig.xml instead.   It seems to be working but again I'm
> > > > missing
> > > > > > some
> > > > > > > highlighted records.
> > > > > > >
> > > > > >
> > > > > > There is no configuration of that highlighter in solrconfig.xml;
> > it's
> > > > > > entirely parameter driven (runtime).
> > > > > >
> > > > > >
> > > > > > > The other thing is I've tried to adjust my unified highlighting
> > > > > settings
> > > > > > in
> > > > > > > solrconfig.xml and they don't  seem to be having any effect
> even
> > > > after
> > > > > > > restarting Solr.  I was just wondering whether there is any
> > > > > highlighting
> > > > > > > information stored at index time. It's taking over 4hours to
> > index
> > > my
> > > > > > > records so it's not easy to keep reindexing my content.
> > > > > > >
> > > > > > > Any ideas on how to handle highlighting of large content  would
> > be
> > > > > > > appreciated.
> > > > > > >
> > > > > > > Shaun
> > > > > > >
> > > > > >
> > > > > > Please read the documentation here thoroughly:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> > > > > > (or earlier version as applicable)
> > > > > > Since you have large bodies of text to highlight, you would
> > strongly
> > > > > > benefit from putting offsets into the search index (and re-index)
> > --
> > > > > > storeOffsetsWithPositions.  That's an option on the
> field/fieldType
> > > in
> > > > > your
> > > > > > schema; it may not be obvious reading the docs.  You have to
> opt-in
> > > to
> > > > > > that; Solr doesn't normally store any info in the index for
> > > > highlighting.
> > > > > >
> > > > > > ~ David Smiley
> > > > > > Apache Lucene/Solr Search Developer
> > > > > > http://www.linkedin.com/in/davidwsmiley
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting large text fields

Shaun Campbell
Hi David

Just reindexed everything and it appears to be performing well and giving
me highlights for the matched text.

Thanks for your help.
Shaun

On Tue, 12 Jan 2021, 21:00 David Smiley, <[hidden email]> wrote:

> The last update to highlighting that I think is pertinent to
> whether highlights match or not is v7.6 which added that hl.weightMatches
> option.  So I recommend upgrading to at least that if you want to
> experiment further.  But... uh.weightMatches highlights more accurately and
> as such is more likely to not highlight as much as you are highlighting
> now, and highlighting more is your goal right now it appears.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Jan 12, 2021 at 2:45 PM Shaun Campbell <[hidden email]>
> wrote:
>
> > That's great David.  So hl.maxAnalyzedChars isn't that critical. I'll
> whack
> > it right up and see what happens.
> >
> > I'm running 7.4 from a few years ago. Should I upgrade?
> >
> > For your info this is what I'm doing with Solr
> > https://dev.fundingawards.nihr.ac.uk/search.
> >
> > Thanks
> > Shaun
> >
> > On Tue, 12 Jan 2021 at 19:33, David Smiley <[hidden email]> wrote:
> >
> > > On Tue, Jan 12, 2021 at 1:08 PM Shaun Campbell <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > Hi David
> > > >
> > > > Getting closer now.
> > > >
> > > > First of all, a bit of a mistake on my part. I have two cores set up
> > and
> > > I
> > > > was changing the solrconfig.xml on the wrong core doh!!  That's why
> > > > highlighting wasn't being turned off.
> > > >
> > > > I think I've got the unified highlighter working.
> > > > storeOffsetsWithPositions was already configured on my field type
> > > > definition, not the field definition, so that was ok.
> > > >
> > > > What it boils down to now I think is hl.maxAnalyzedChars. I'm getting
> > > > highlighting on some records and not others, making it confusing as
> to
> > > > where the match is with my dismax parser.  I increased
> > > > my hl.maxAnalyzedChars to 1300000 and now it's highlighting more
> > records.
> > > > Two questions:
> > > >
> > > > 1. Have you any guidelines as to what could be a
> > > > maximum hl.maxAnalyzedChars without impacting performance or memory?
> > > >
> > >
> > > With storeOffsetsWithPositions, highlighting is super-fast, and so this
> > > hl.maxAnalyzedChars threshold is of marginal utility, like only to cap
> > the
> > > amount of memory used if you have some truly humongous docs and it's
> okay
> > > only highlight the first X megabytes of them.  Maybe set to a 100MB
> worth
> > > of text, or something like that.
> > >
> > >
> > > > 2. Do you know a way to query the maximum length of text in a field
> so
> > > that
> > > > I can set hl.maxAnalyzedChars accordingly?  Just thinking I can
> > probably
> > > > modify my java indexer to log the maximum content length.  Actually,
> I
> > > > probably don't want the maximum but some value that highlights 90-95%
> > > > records
> > > >
> > >
> > > Eh... not really.  Maybe some approximation hacks involving function
> > > queries on norms but I'd not bother in favor of just using a high
> > threshold
> > > such that this won't be an issue.
> > >
> > > All this said, this threshold is *not* the only reason why you might
> not
> > be
> > > getting highlights that you expect.  If you are using a recent Solr
> > > version, you might try toggling the hl.weightMatches boolean, which
> could
> > > make a difference for certain query arrangements.  There's a JIRA issue
> > > pertaining to this one, and I haven't investigated it yet.
> > >
> > > ~ David
> > >
> > >
> > > >
> > > > Thanks
> > > > Shaun
> > > >
> > > > On Tue, 12 Jan 2021 at 16:30, David Smiley <[hidden email]>
> wrote:
> > > >
> > > > > On Tue, Jan 12, 2021 at 9:39 AM Shaun Campbell <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi David
> > > > > >
> > > > > > First of all I wanted to say I'm working off your book!!  Third
> > > > edition,
> > > > > > and I think it's a bit out of date now. I was just going to try
> > > > following
> > > > > > the section on the Postings highlighter, but I see that's been
> > > absorbed
> > > > > > into the Unified highlighter. I find your book easier to follow
> > than
> > > > the
> > > > > > official documentation though.
> > > > > >
> > > > >
> > > > > Thanks :-D.  I do maintain the Solr Reference Guide for the parts
> of
> > > > code I
> > > > > touch, including highlighting, so I hope what's there makes sense
> > too.
> > > > >
> > > > >
> > > > > > I am going to try to configure the unified highlighter, and I
> will
> > > add
> > > > > that
> > > > > > storeOffsetsWithPositions to the schema (which I saw in your
> book)
> > > and
> > > > I
> > > > > > will try indexing again from scratch.  Was getting some funny
> > things
> > > > > going
> > > > > > on where I thought I'd turned highlighting off and it was still
> > > giving
> > > > me
> > > > > > highlights.
> > > > > >
> > > > >
> > > > > hl=true/false
> > > > >
> > > > >
> > > > > > Actually just re-reading your email again, are you saying that
> you
> > > > can't
> > > > > > configure highlighting in solrconfig.xml? That's where I always
> > > > configure
> > > > > > original highlighting in my dismax search handler. Am I supposed
> to
> > > add
> > > > > > highlighting to each request?
> > > > > >
> > > > >
> > > > > You can set highlighting and other *parameters* in solrconfig.xml
> for
> > > > > request handlers.  But the dedicated <highlighting> plugin info is
> > only
> > > > for
> > > > > the original and Fast Vector Highlighters.
> > > > >
> > > > > ~ David
> > > > >
> > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Shaun
> > > > > >
> > > > > > On Mon, 11 Jan 2021 at 20:57, David Smiley <[hidden email]>
> > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > I worked on the UnifiedHighlighter a lot and want to help you!
> > > > > > >
> > > > > > > On Mon, Jan 11, 2021 at 9:58 AM Shaun Campbell <
> > > > > [hidden email]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I've been using highlighting for a while, using the original
> > > > > > highlighter,
> > > > > > > > and just come across a problem with fields that contain a
> large
> > > > > amount
> > > > > > of
> > > > > > > > text, approx 250k characters. I only have about 2,000 records
> > but
> > > > > each
> > > > > > > one
> > > > > > > > contains a journal publication to search through.
> > > > > > > >
> > > > > > > > What I noticed is that some records didn't return a highlight
> > > even
> > > > > > though
> > > > > > > > they matched on the content. I noticed the
> hl.maxAnalyzedChars
> > > > > > parameter
> > > > > > > > and increased that, but  it allowed some records to be
> > > highlighted,
> > > > > but
> > > > > > > not
> > > > > > > > all, and then it caused memory problems on the server.
> > > Performance
> > > > > is
> > > > > > > also
> > > > > > > > very poor.
> > > > > > > >
> > > > > > >
> > > > > > > I've been thinking hl.maxAnalyzedChars should maybe default to
> no
> > > > limit
> > > > > > --
> > > > > > > it's a performance threshold but perhaps better to opt-in to
> > such a
> > > > > limit
> > > > > > > then scratch your head for a long time wondering why a search
> > > result
> > > > > > isn't
> > > > > > > showing highlights.
> > > > > > >
> > > > > > >
> > > > > > > > To try to fix this I've tried  to configure the unified
> > > highlighter
> > > > > in
> > > > > > my
> > > > > > > > solrconfig.xml instead.   It seems to be working but again
> I'm
> > > > > missing
> > > > > > > some
> > > > > > > > highlighted records.
> > > > > > > >
> > > > > > >
> > > > > > > There is no configuration of that highlighter in
> solrconfig.xml;
> > > it's
> > > > > > > entirely parameter driven (runtime).
> > > > > > >
> > > > > > >
> > > > > > > > The other thing is I've tried to adjust my unified
> highlighting
> > > > > > settings
> > > > > > > in
> > > > > > > > solrconfig.xml and they don't  seem to be having any effect
> > even
> > > > > after
> > > > > > > > restarting Solr.  I was just wondering whether there is any
> > > > > > highlighting
> > > > > > > > information stored at index time. It's taking over 4hours to
> > > index
> > > > my
> > > > > > > > records so it's not easy to keep reindexing my content.
> > > > > > > >
> > > > > > > > Any ideas on how to handle highlighting of large content
> would
> > > be
> > > > > > > > appreciated.
> > > > > > > >
> > > > > > > > Shaun
> > > > > > > >
> > > > > > >
> > > > > > > Please read the documentation here thoroughly:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/highlighting.html#the-unified-highlighter
> > > > > > > (or earlier version as applicable)
> > > > > > > Since you have large bodies of text to highlight, you would
> > > strongly
> > > > > > > benefit from putting offsets into the search index (and
> re-index)
> > > --
> > > > > > > storeOffsetsWithPositions.  That's an option on the
> > field/fieldType
> > > > in
> > > > > > your
> > > > > > > schema; it may not be obvious reading the docs.  You have to
> > opt-in
> > > > to
> > > > > > > that; Solr doesn't normally store any info in the index for
> > > > > highlighting.
> > > > > > >
> > > > > > > ~ David Smiley
> > > > > > > Apache Lucene/Solr Search Developer
> > > > > > > http://www.linkedin.com/in/davidwsmiley
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>