highlighting a whole html document using Unified highlighter

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

highlighting a whole html document using Unified highlighter

Serkan KAZANCI
Hi,

 

I use solr to search over a million html documents, when a document is
searched and displayed, I want to highlight the keywords that are used to
find and access the document.

 

Unified highlighter is fast, accurate and supports different languages but
only highlights passages with given parameters.

 

How can I highlight a whole html document using Unified highlighter? I have
written a php code but it cannot do the complex word stemming functions.

 

 

Thanks,

 

Serkan

Reply | Threaded
Open this post in threaded view
|

Re: highlighting a whole html document using Unified highlighter

Jörn Franke
hl.fragsize=0

https://lucene.apache.org/solr/guide/8_5/highlighting.html



> Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <[hidden email]>:
>
> Hi,
>
>
>
> I use solr to search over a million html documents, when a document is
> searched and displayed, I want to highlight the keywords that are used to
> find and access the document.
>
>
>
> Unified highlighter is fast, accurate and supports different languages but
> only highlights passages with given parameters.
>
>
>
> How can I highlight a whole html document using Unified highlighter? I have
> written a php code but it cannot do the complex word stemming functions.
>
>
>
>
>
> Thanks,
>
>
>
> Serkan
>
Reply | Threaded
Open this post in threaded view
|

RE: highlighting a whole html document using Unified highlighter

Serkan KAZANCI
Thanks Jörn for the answer,

I use post tool to index html documents, so the html tags are stripped when indexed and stored. The remaining text is mapped to the field content by default.

hl.fragsize=0 works perfect for the indexed document, but I can only display highlighted text-only version of html document because the html tags are stripped.

So is it possible to index and store the html document without stripping the html tags, so that when the document is displayed with hl.fragsize=0 parameter, it is displayed as original html document?

Or

Is it possible to give a whole html document as a parameter to the Unified highlighter so that output is also a highlighted html document?

Or

Do you have a better idea to highlight the keywords of the whole html document?

 Thanks,
 
 Serkan

-----Original Message-----
From: Jörn Franke [mailto:[hidden email]]
Sent: Sunday, May 24, 2020 1:22 PM
To: [hidden email]
Subject: Re: highlighting a whole html document using Unified highlighter

hl.fragsize=0

https://lucene.apache.org/solr/guide/8_5/highlighting.html



> Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <[hidden email]>:
>
> Hi,
>
>
>
> I use solr to search over a million html documents, when a document is
> searched and displayed, I want to highlight the keywords that are used to
> find and access the document.
>
>
>
> Unified highlighter is fast, accurate and supports different languages but
> only highlights passages with given parameters.
>
>
>
> How can I highlight a whole html document using Unified highlighter? I have
> written a php code but it cannot do the complex word stemming functions.
>
>
>
>
>
> Thanks,
>
>
>
> Serkan
>

Reply | Threaded
Open this post in threaded view
|

Re: highlighting a whole html document using Unified highlighter

David Smiley
Instead of stripping the HTML for the stored value, leave it be and remove
it during the analysis stage with solr.HTMLStripCharFilterFactory
<https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory>
This means the searchable text will only be the visible text, basically.
And the highlighter will only highlight what's searchable.

I suggest doing some experimentation for searching for words that you know
are directly adjacent (no spaces) to opening and closing tags to make sure
that the inserted HTML markup for the highlight balance correctly.  Use a
"phrase query" (quoted) as well, and see if you can highlight around markup
like "phrase</p>query" to see what happens.  You might need to set
hl.weightMatches=false to ensure the words separately are highlighted.  I
suspect you will find there is a problem, and the root cause is here:
LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734>   It's on
my long TODO list but hasn't bitten me lately so I've neglected it.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <[hidden email]>
wrote:

> Thanks Jörn for the answer,
>
> I use post tool to index html documents, so the html tags are stripped
> when indexed and stored. The remaining text is mapped to the field content
> by default.
>
> hl.fragsize=0 works perfect for the indexed document, but I can only
> display highlighted text-only version of html document because the html
> tags are stripped.
>
> So is it possible to index and store the html document without stripping
> the html tags, so that when the document is displayed with hl.fragsize=0
> parameter, it is displayed as original html document?
>
> Or
>
> Is it possible to give a whole html document as a parameter to the Unified
> highlighter so that output is also a highlighted html document?
>
> Or
>
> Do you have a better idea to highlight the keywords of the whole html
> document?
>
>  Thanks,
>
>  Serkan
>
> -----Original Message-----
> From: Jörn Franke [mailto:[hidden email]]
> Sent: Sunday, May 24, 2020 1:22 PM
> To: [hidden email]
> Subject: Re: highlighting a whole html document using Unified highlighter
>
> hl.fragsize=0
>
> https://lucene.apache.org/solr/guide/8_5/highlighting.html
>
>
>
> > Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <[hidden email]>:
> >
> > Hi,
> >
> >
> >
> > I use solr to search over a million html documents, when a document is
> > searched and displayed, I want to highlight the keywords that are used to
> > find and access the document.
> >
> >
> >
> > Unified highlighter is fast, accurate and supports different languages
> but
> > only highlights passages with given parameters.
> >
> >
> >
> > How can I highlight a whole html document using Unified highlighter? I
> have
> > written a php code but it cannot do the complex word stemming functions.
> >
> >
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Serkan
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: highlighting a whole html document using Unified highlighter

Serkan KAZANCI
Hi David,

I have many meta-tags in html documents like  <meta name="tarih" content="2019-10-15T23:59:59Z"> which matches the field descriptions in schema file.

As I understand, you propose to index the whole html document as one text file and map it to a search field (do you?) . That would take care of the html highlight issue, however I would lose the field information coming from meta-tags .

So is it possible to index the html document as html document ? (preserving the field data coming from meta-tags and not strip the html tags)

Then I could use solr.HTMLStripCharFilterFactory for analysis.

Thank You,

Serkan,




-----Original Message-----
From: David Smiley [mailto:[hidden email]]
Sent: Sunday, May 24, 2020 5:26 PM
To: solr-user
Subject: Re: highlighting a whole html document using Unified highlighter

Instead of stripping the HTML for the stored value, leave it be and remove
it during the analysis stage with solr.HTMLStripCharFilterFactory
<https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory>
This means the searchable text will only be the visible text, basically.
And the highlighter will only highlight what's searchable.

I suggest doing some experimentation for searching for words that you know
are directly adjacent (no spaces) to opening and closing tags to make sure
that the inserted HTML markup for the highlight balance correctly.  Use a
"phrase query" (quoted) as well, and see if you can highlight around markup
like "phrase</p>query" to see what happens.  You might need to set
hl.weightMatches=false to ensure the words separately are highlighted.  I
suspect you will find there is a problem, and the root cause is here:
LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734>   It's on
my long TODO list but hasn't bitten me lately so I've neglected it.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <[hidden email]>
wrote:

> Thanks Jörn for the answer,
>
> I use post tool to index html documents, so the html tags are stripped
> when indexed and stored. The remaining text is mapped to the field content
> by default.
>
> hl.fragsize=0 works perfect for the indexed document, but I can only
> display highlighted text-only version of html document because the html
> tags are stripped.
>
> So is it possible to index and store the html document without stripping
> the html tags, so that when the document is displayed with hl.fragsize=0
> parameter, it is displayed as original html document?
>
> Or
>
> Is it possible to give a whole html document as a parameter to the Unified
> highlighter so that output is also a highlighted html document?
>
> Or
>
> Do you have a better idea to highlight the keywords of the whole html
> document?
>
>  Thanks,
>
>  Serkan
>
> -----Original Message-----
> From: Jörn Franke [mailto:[hidden email]]
> Sent: Sunday, May 24, 2020 1:22 PM
> To: [hidden email]
> Subject: Re: highlighting a whole html document using Unified highlighter
>
> hl.fragsize=0
>
> https://lucene.apache.org/solr/guide/8_5/highlighting.html
>
>
>
> > Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <[hidden email]>:
> >
> > Hi,
> >
> >
> >
> > I use solr to search over a million html documents, when a document is
> > searched and displayed, I want to highlight the keywords that are used to
> > find and access the document.
> >
> >
> >
> > Unified highlighter is fast, accurate and supports different languages
> but
> > only highlights passages with given parameters.
> >
> >
> >
> > How can I highlight a whole html document using Unified highlighter? I
> have
> > written a php code but it cannot do the complex word stemming functions.
> >
> >
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Serkan
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: highlighting a whole html document using Unified highlighter

david.w.smiley@gmail.com
These strategies are not mutually exclusive.  Yes I do suggest having the
HTML in whole go into one searchable field to satisfy your highlighting
use-case.  But I can imagine you will also want some document metadata in
separate fields.  It's up to you to parse that out somehow and add it.  You
mentioned you are using bin/post but, IMO, that capability is more for
quick experimentation / tutorials, some POCs, or very simple use-cases.  I
doubt you can do what I suggest while still using bin/post.  You might be
able to use "SolrCell" AKA ExtractingRequestHandler directly, which is what
bin/post does with HTML.

Good luck!

~ David


On Sun, May 24, 2020 at 10:52 AM Serkan KAZANCI <[hidden email]>
wrote:

> Hi David,
>
> I have many meta-tags in html documents like  <meta name="tarih"
> content="2019-10-15T23:59:59Z"> which matches the field descriptions in
> schema file.
>
> As I understand, you propose to index the whole html document as one text
> file and map it to a search field (do you?) . That would take care of the
> html highlight issue, however I would lose the field information coming
> from meta-tags .
>
> So is it possible to index the html document as html document ?
> (preserving the field data coming from meta-tags and not strip the html
> tags)
>
> Then I could use solr.HTMLStripCharFilterFactory for analysis.
>
> Thank You,
>
> Serkan,
>
>
>
>
> -----Original Message-----
> From: David Smiley [mailto:[hidden email]]
> Sent: Sunday, May 24, 2020 5:26 PM
> To: solr-user
> Subject: Re: highlighting a whole html document using Unified highlighter
>
> Instead of stripping the HTML for the stored value, leave it be and remove
> it during the analysis stage with solr.HTMLStripCharFilterFactory
> <
> https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory
> >
> This means the searchable text will only be the visible text, basically.
> And the highlighter will only highlight what's searchable.
>
> I suggest doing some experimentation for searching for words that you know
> are directly adjacent (no spaces) to opening and closing tags to make sure
> that the inserted HTML markup for the highlight balance correctly.  Use a
> "phrase query" (quoted) as well, and see if you can highlight around markup
> like "phrase</p>query" to see what happens.  You might need to set
> hl.weightMatches=false to ensure the words separately are highlighted.  I
> suspect you will find there is a problem, and the root cause is here:
> LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734>   It's on
> my long TODO list but hasn't bitten me lately so I've neglected it.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <[hidden email]>
> wrote:
>
> > Thanks Jörn for the answer,
> >
> > I use post tool to index html documents, so the html tags are stripped
> > when indexed and stored. The remaining text is mapped to the field
> content
> > by default.
> >
> > hl.fragsize=0 works perfect for the indexed document, but I can only
> > display highlighted text-only version of html document because the html
> > tags are stripped.
> >
> > So is it possible to index and store the html document without stripping
> > the html tags, so that when the document is displayed with hl.fragsize=0
> > parameter, it is displayed as original html document?
> >
> > Or
> >
> > Is it possible to give a whole html document as a parameter to the
> Unified
> > highlighter so that output is also a highlighted html document?
> >
> > Or
> >
> > Do you have a better idea to highlight the keywords of the whole html
> > document?
> >
> >  Thanks,
> >
> >  Serkan
> >
> > -----Original Message-----
> > From: Jörn Franke [mailto:[hidden email]]
> > Sent: Sunday, May 24, 2020 1:22 PM
> > To: [hidden email]
> > Subject: Re: highlighting a whole html document using Unified highlighter
> >
> > hl.fragsize=0
> >
> > https://lucene.apache.org/solr/guide/8_5/highlighting.html
> >
> >
> >
> > > Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <[hidden email]>:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I use solr to search over a million html documents, when a document is
> > > searched and displayed, I want to highlight the keywords that are used
> to
> > > find and access the document.
> > >
> > >
> > >
> > > Unified highlighter is fast, accurate and supports different languages
> > but
> > > only highlights passages with given parameters.
> > >
> > >
> > >
> > > How can I highlight a whole html document using Unified highlighter? I
> > have
> > > written a php code but it cannot do the complex word stemming
> functions.
> > >
> > >
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > >
> > > Serkan
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: highlighting a whole html document using Unified highlighter

Serkan KAZANCI
All clear.

Thanks David,

> On 24 May 2020, at 18:57, David Smiley <[hidden email]> wrote:
>
> These strategies are not mutually exclusive.  Yes I do suggest having the
> HTML in whole go into one searchable field to satisfy your highlighting
> use-case.  But I can imagine you will also want some document metadata in
> separate fields.  It's up to you to parse that out somehow and add it.  You
> mentioned you are using bin/post but, IMO, that capability is more for
> quick experimentation / tutorials, some POCs, or very simple use-cases.  I
> doubt you can do what I suggest while still using bin/post.  You might be
> able to use "SolrCell" AKA ExtractingRequestHandler directly, which is what
> bin/post does with HTML.
>
> Good luck!
>
> ~ David
>
>
>> On Sun, May 24, 2020 at 10:52 AM Serkan KAZANCI <[hidden email]>
>> wrote:
>>
>> Hi David,
>>
>> I have many meta-tags in html documents like  <meta name="tarih"
>> content="2019-10-15T23:59:59Z"> which matches the field descriptions in
>> schema file.
>>
>> As I understand, you propose to index the whole html document as one text
>> file and map it to a search field (do you?) . That would take care of the
>> html highlight issue, however I would lose the field information coming
>> from meta-tags .
>>
>> So is it possible to index the html document as html document ?
>> (preserving the field data coming from meta-tags and not strip the html
>> tags)
>>
>> Then I could use solr.HTMLStripCharFilterFactory for analysis.
>>
>> Thank You,
>>
>> Serkan,
>>
>>
>>
>>
>> -----Original Message-----
>> From: David Smiley [mailto:[hidden email]]
>> Sent: Sunday, May 24, 2020 5:26 PM
>> To: solr-user
>> Subject: Re: highlighting a whole html document using Unified highlighter
>>
>> Instead of stripping the HTML for the stored value, leave it be and remove
>> it during the analysis stage with solr.HTMLStripCharFilterFactory
>> <
>> https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory
>>>
>> This means the searchable text will only be the visible text, basically.
>> And the highlighter will only highlight what's searchable.
>>
>> I suggest doing some experimentation for searching for words that you know
>> are directly adjacent (no spaces) to opening and closing tags to make sure
>> that the inserted HTML markup for the highlight balance correctly.  Use a
>> "phrase query" (quoted) as well, and see if you can highlight around markup
>> like "phrase</p>query" to see what happens.  You might need to set
>> hl.weightMatches=false to ensure the words separately are highlighted.  I
>> suspect you will find there is a problem, and the root cause is here:
>> LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734>   It's on
>> my long TODO list but hasn't bitten me lately so I've neglected it.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <[hidden email]>
>> wrote:
>>
>>> Thanks Jörn for the answer,
>>>
>>> I use post tool to index html documents, so the html tags are stripped
>>> when indexed and stored. The remaining text is mapped to the field
>> content
>>> by default.
>>>
>>> hl.fragsize=0 works perfect for the indexed document, but I can only
>>> display highlighted text-only version of html document because the html
>>> tags are stripped.
>>>
>>> So is it possible to index and store the html document without stripping
>>> the html tags, so that when the document is displayed with hl.fragsize=0
>>> parameter, it is displayed as original html document?
>>>
>>> Or
>>>
>>> Is it possible to give a whole html document as a parameter to the
>> Unified
>>> highlighter so that output is also a highlighted html document?
>>>
>>> Or
>>>
>>> Do you have a better idea to highlight the keywords of the whole html
>>> document?
>>>
>>> Thanks,
>>>
>>> Serkan
>>>
>>> -----Original Message-----
>>> From: Jörn Franke [mailto:[hidden email]]
>>> Sent: Sunday, May 24, 2020 1:22 PM
>>> To: [hidden email]
>>> Subject: Re: highlighting a whole html document using Unified highlighter
>>>
>>> hl.fragsize=0
>>>
>>> https://lucene.apache.org/solr/guide/8_5/highlighting.html
>>>
>>>
>>>
>>>> Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <[hidden email]>:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I use solr to search over a million html documents, when a document is
>>>> searched and displayed, I want to highlight the keywords that are used
>> to
>>>> find and access the document.
>>>>
>>>>
>>>>
>>>> Unified highlighter is fast, accurate and supports different languages
>>> but
>>>> only highlights passages with given parameters.
>>>>
>>>>
>>>>
>>>> How can I highlight a whole html document using Unified highlighter? I
>>> have
>>>> written a php code but it cannot do the complex word stemming
>> functions.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> Serkan
>>>>
>>>
>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: highlighting a whole html document using Unified highlighter

Jörn Franke
In reply to this post by Serkan KAZANCI
Hmm maybe more insights on the use case would be useful. It looks like what David says about metadata could make sense in your scenario depending on the requirements...



> Am 24.05.2020 um 13:20 schrieb Serkan KAZANCI <[hidden email]>:
>
> Thanks Jörn for the answer,
>
> I use post tool to index html documents, so the html tags are stripped when indexed and stored. The remaining text is mapped to the field content by default.
>
> hl.fragsize=0 works perfect for the indexed document, but I can only display highlighted text-only version of html document because the html tags are stripped.
>
> So is it possible to index and store the html document without stripping the html tags, so that when the document is displayed with hl.fragsize=0 parameter, it is displayed as original html document?
>
> Or
>
> Is it possible to give a whole html document as a parameter to the Unified highlighter so that output is also a highlighted html document?
>
> Or
>
> Do you have a better idea to highlight the keywords of the whole html document?
>
> Thanks,
>
> Serkan
>
> -----Original Message-----
> From: Jörn Franke [mailto:[hidden email]]
> Sent: Sunday, May 24, 2020 1:22 PM
> To: [hidden email]
> Subject: Re: highlighting a whole html document using Unified highlighter
>
> hl.fragsize=0
>
> https://lucene.apache.org/solr/guide/8_5/highlighting.html
>
>
>
>> Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <[hidden email]>:
>>
>> Hi,
>>
>>
>>
>> I use solr to search over a million html documents, when a document is
>> searched and displayed, I want to highlight the keywords that are used to
>> find and access the document.
>>
>>
>>
>> Unified highlighter is fast, accurate and supports different languages but
>> only highlights passages with given parameters.
>>
>>
>>
>> How can I highlight a whole html document using Unified highlighter? I have
>> written a php code but it cannot do the complex word stemming functions.
>>
>>
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Serkan
>>
>