Quantcast

how to present html content in browse

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

how to present html content in browse

srini
I am indexing records from database using DIH. The content of my record is in html format. When I use browse
I would like to show the content in html format, not in text format. Any ideas?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to present html content in browse

Lance Norskog-2
Make two fields, one with stores the stripped HTML and another that
stores the parsed HTML. You can use <copyField> so that you do not
have to submit the html page twice.

You would mark the stripped field 'indexed=true stored=false' and the
full text field the other way around. The full text field should be a
String type.

On Thu, May 3, 2012 at 1:04 PM, srini <[hidden email]> wrote:
> I am indexing records from database using DIH. The content of my record is in
> html format. When I use browse
> I would like to show the content in html format, not in text format. Any
> ideas?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
> Sent from the Solr - User mailing list archive at Nabble.com.



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to present html content in browse

okayndc
Hello,

I'm having a hard time understanding this, and I had this same question.

When using DIH should the HTML field be stored in the raw HTML string field
or the stripped field?
Also what source field(s) need to be copied and to what destination?

Thanks


On Thu, May 3, 2012 at 10:15 PM, Lance Norskog <[hidden email]> wrote:

> Make two fields, one with stores the stripped HTML and another that
> stores the parsed HTML. You can use <copyField> so that you do not
> have to submit the html page twice.
>
> You would mark the stripped field 'indexed=true stored=false' and the
> full text field the other way around. The full text field should be a
> String type.
>
> On Thu, May 3, 2012 at 1:04 PM, srini <[hidden email]> wrote:
> > I am indexing records from database using DIH. The content of my record
> is in
> > html format. When I use browse
> > I would like to show the content in html format, not in text format. Any
> > ideas?
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
> --
> Lance Norskog
> [hidden email]
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to present html content in browse

Jack Krupansky-2
1. The raw html field (call it, "text_html") would be a "string" type field
that is "stored" but not "indexed". This is the field you direct DIH to
output to. This is the field you would return in your search results with
the HTML to be displayed.

2. The stripped field (call it, "text_stripped") would be a "text" type
field (where "text" is a field type you add that uses the HTML strip char
filter as shown below) that is not "stored" but is "indexed. Add a CopyField
to your schema that copies from the raw html field to the stripped field
(say, "text_html" to "text_stripped".)

For reference on HTML strip (HTMLStripCharFilterFactory), see:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Which has:

<fieldtype name="text" class="solr.TextField">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldtype>

Although, you might want to call that field type "text_stripped" to avoid
confusion with a simple text field

You can add HTMLStripCharFilterFactory to some other field type that you
might want to use, but this "charFilter" needs to be before the "tokenizer".
The "text" field type above is just an example.

-- Jack Krupansky

-----Original Message-----
From: okayndc
Sent: Friday, May 04, 2012 1:01 PM
To: [hidden email]
Subject: Re: how to present html content in browse

Hello,

I'm having a hard time understanding this, and I had this same question.

When using DIH should the HTML field be stored in the raw HTML string field
or the stripped field?
Also what source field(s) need to be copied and to what destination?

Thanks


On Thu, May 3, 2012 at 10:15 PM, Lance Norskog <[hidden email]> wrote:

> Make two fields, one with stores the stripped HTML and another that
> stores the parsed HTML. You can use <copyField> so that you do not
> have to submit the html page twice.
>
> You would mark the stripped field 'indexed=true stored=false' and the
> full text field the other way around. The full text field should be a
> String type.
>
> On Thu, May 3, 2012 at 1:04 PM, srini <[hidden email]> wrote:
> > I am indexing records from database using DIH. The content of my record
> is in
> > html format. When I use browse
> > I would like to show the content in html format, not in text format. Any
> > ideas?
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
> --
> Lance Norskog
> [hidden email]
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to present html content in browse

okayndc
Is it possible to return the HTML field highlighted?

On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky <[hidden email]>wrote:

> 1. The raw html field (call it, "text_html") would be a "string" type
> field that is "stored" but not "indexed". This is the field you direct DIH
> to output to. This is the field you would return in your search results
> with the HTML to be displayed.
>
> 2. The stripped field (call it, "text_stripped") would be a "text" type
> field (where "text" is a field type you add that uses the HTML strip char
> filter as shown below) that is not "stored" but is "indexed. Add a
> CopyField to your schema that copies from the raw html field to the
> stripped field (say, "text_html" to "text_stripped".)
>
> For reference on HTML strip (HTMLStripCharFilterFactory), see:
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>
>
> Which has:
>
> <fieldtype name="text" class="solr.TextField">
>  <analyzer>
>   <charFilter class="solr.**HTMLStripCharFilterFactory"/>
>   <charFilter class="solr.**MappingCharFilterFactory" mapping="mapping-**
> ISOLatin1Accent.txt"/>
>   <tokenizer class="solr.**StandardTokenizerFactory"/>
>   <filter class="solr.**LowerCaseFilterFactory"/>
>   <filter class="solr.StopFilterFactory"**/>
>   <filter class="solr.**PorterStemFilterFactory"/>
>  </analyzer>
> </fieldtype>
>
> Although, you might want to call that field type "text_stripped" to avoid
> confusion with a simple text field
>
> You can add HTMLStripCharFilterFactory to some other field type that you
> might want to use, but this "charFilter" needs to be before the
> "tokenizer". The "text" field type above is just an example.
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Friday, May 04, 2012 1:01 PM
> To: [hidden email]
> Subject: Re: how to present html content in browse
>
>
> Hello,
>
> I'm having a hard time understanding this, and I had this same question.
>
> When using DIH should the HTML field be stored in the raw HTML string field
> or the stripped field?
> Also what source field(s) need to be copied and to what destination?
>
> Thanks
>
>
> On Thu, May 3, 2012 at 10:15 PM, Lance Norskog <[hidden email]> wrote:
>
>  Make two fields, one with stores the stripped HTML and another that
>> stores the parsed HTML. You can use <copyField> so that you do not
>> have to submit the html page twice.
>>
>> You would mark the stripped field 'indexed=true stored=false' and the
>> full text field the other way around. The full text field should be a
>> String type.
>>
>> On Thu, May 3, 2012 at 1:04 PM, srini <[hidden email]> wrote:
>> > I am indexing records from database using DIH. The content of my record
>> is in
>> > html format. When I use browse
>> > I would like to show the content in html format, not in text format. Any
>> > ideas?
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.**nabble.com/how-to-present-**
>> html-content-in-browse-**tp3960327.html<http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html>
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>> --
>> Lance Norskog
>> [hidden email]
>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to present html content in browse

Jack Krupansky-2
Evidently there was a problem with highlighting of HTML that is supposedly
fixed in Solr 3.6 and trunk:

https://issues.apache.org/jira/browse/SOLR-42

-- Jack Krupansky

-----Original Message-----
From: okayndc
Sent: Friday, May 04, 2012 4:35 PM
To: [hidden email]
Subject: Re: how to present html content in browse

Is it possible to return the HTML field highlighted?

On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky
<[hidden email]>wrote:

> 1. The raw html field (call it, "text_html") would be a "string" type
> field that is "stored" but not "indexed". This is the field you direct DIH
> to output to. This is the field you would return in your search results
> with the HTML to be displayed.
>
> 2. The stripped field (call it, "text_stripped") would be a "text" type
> field (where "text" is a field type you add that uses the HTML strip char
> filter as shown below) that is not "stored" but is "indexed. Add a
> CopyField to your schema that copies from the raw html field to the
> stripped field (say, "text_html" to "text_stripped".)
>
> For reference on HTML strip (HTMLStripCharFilterFactory), see:
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>
>
> Which has:
>
> <fieldtype name="text" class="solr.TextField">
>  <analyzer>
>   <charFilter class="solr.**HTMLStripCharFilterFactory"/>
>   <charFilter class="solr.**MappingCharFilterFactory" mapping="mapping-**
> ISOLatin1Accent.txt"/>
>   <tokenizer class="solr.**StandardTokenizerFactory"/>
>   <filter class="solr.**LowerCaseFilterFactory"/>
>   <filter class="solr.StopFilterFactory"**/>
>   <filter class="solr.**PorterStemFilterFactory"/>
>  </analyzer>
> </fieldtype>
>
> Although, you might want to call that field type "text_stripped" to avoid
> confusion with a simple text field
>
> You can add HTMLStripCharFilterFactory to some other field type that you
> might want to use, but this "charFilter" needs to be before the
> "tokenizer". The "text" field type above is just an example.
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Friday, May 04, 2012 1:01 PM
> To: [hidden email]
> Subject: Re: how to present html content in browse
>
>
> Hello,
>
> I'm having a hard time understanding this, and I had this same question.
>
> When using DIH should the HTML field be stored in the raw HTML string
> field
> or the stripped field?
> Also what source field(s) need to be copied and to what destination?
>
> Thanks
>
>
> On Thu, May 3, 2012 at 10:15 PM, Lance Norskog <[hidden email]> wrote:
>
>  Make two fields, one with stores the stripped HTML and another that
>> stores the parsed HTML. You can use <copyField> so that you do not
>> have to submit the html page twice.
>>
>> You would mark the stripped field 'indexed=true stored=false' and the
>> full text field the other way around. The full text field should be a
>> String type.
>>
>> On Thu, May 3, 2012 at 1:04 PM, srini <[hidden email]> wrote:
>> > I am indexing records from database using DIH. The content of my record
>> is in
>> > html format. When I use browse
>> > I would like to show the content in html format, not in text format.
>> > Any
>> > ideas?
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.**nabble.com/how-to-present-**
>> html-content-in-browse-**tp3960327.html<http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html>
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>> --
>> Lance Norskog
>> [hidden email]
>>
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to present html content in browse

okayndc
Okay, thanks for the info.

On Fri, May 4, 2012 at 4:42 PM, Jack Krupansky <[hidden email]>wrote:

> Evidently there was a problem with highlighting of HTML that is supposedly
> fixed in Solr 3.6 and trunk:
>
> https://issues.apache.org/**jira/browse/SOLR-42<https://issues.apache.org/jira/browse/SOLR-42>
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Friday, May 04, 2012 4:35 PM
>
> To: [hidden email]
> Subject: Re: how to present html content in browse
>
> Is it possible to return the HTML field highlighted?
>
> On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky <[hidden email]>**
> wrote:
>
>  1. The raw html field (call it, "text_html") would be a "string" type
>> field that is "stored" but not "indexed". This is the field you direct DIH
>> to output to. This is the field you would return in your search results
>> with the HTML to be displayed.
>>
>> 2. The stripped field (call it, "text_stripped") would be a "text" type
>> field (where "text" is a field type you add that uses the HTML strip char
>> filter as shown below) that is not "stored" but is "indexed. Add a
>> CopyField to your schema that copies from the raw html field to the
>> stripped field (say, "text_html" to "text_stripped".)
>>
>> For reference on HTML strip (HTMLStripCharFilterFactory), see:
>> http://wiki.apache.org/solr/****AnalyzersTokenizersTokenFilter****s<http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s>
>> <http://wiki.apache.org/**solr/**AnalyzersTokenizersTokenFilter**s<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>
>> >
>>
>>
>> Which has:
>>
>> <fieldtype name="text" class="solr.TextField">
>>  <analyzer>
>>  <charFilter class="solr.****HTMLStripCharFilterFactory"/>
>>  <charFilter class="solr.****MappingCharFilterFactory"
>> mapping="mapping-**
>> ISOLatin1Accent.txt"/>
>>  <tokenizer class="solr.****StandardTokenizerFactory"/>
>>  <filter class="solr.****LowerCaseFilterFactory"/>
>>  <filter class="solr.StopFilterFactory"****/>
>>  <filter class="solr.****PorterStemFilterFactory"/>
>>
>>  </analyzer>
>> </fieldtype>
>>
>> Although, you might want to call that field type "text_stripped" to avoid
>> confusion with a simple text field
>>
>> You can add HTMLStripCharFilterFactory to some other field type that you
>> might want to use, but this "charFilter" needs to be before the
>> "tokenizer". The "text" field type above is just an example.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: okayndc
>> Sent: Friday, May 04, 2012 1:01 PM
>> To: [hidden email]
>> Subject: Re: how to present html content in browse
>>
>>
>> Hello,
>>
>> I'm having a hard time understanding this, and I had this same question.
>>
>> When using DIH should the HTML field be stored in the raw HTML string
>> field
>> or the stripped field?
>> Also what source field(s) need to be copied and to what destination?
>>
>> Thanks
>>
>>
>> On Thu, May 3, 2012 at 10:15 PM, Lance Norskog <[hidden email]> wrote:
>>
>>  Make two fields, one with stores the stripped HTML and another that
>>
>>> stores the parsed HTML. You can use <copyField> so that you do not
>>> have to submit the html page twice.
>>>
>>> You would mark the stripped field 'indexed=true stored=false' and the
>>> full text field the other way around. The full text field should be a
>>> String type.
>>>
>>> On Thu, May 3, 2012 at 1:04 PM, srini <[hidden email]> wrote:
>>> > I am indexing records from database using DIH. The content of my record
>>> is in
>>> > html format. When I use browse
>>> > I would like to show the content in html format, not in text format. >
>>> Any
>>> > ideas?
>>> >
>>> > --
>>> > View this message in context:
>>> http://lucene.472066.n3.**nabb**le.com/how-to-present-**<http://nabble.com/how-to-present-**>
>>> html-content-in-browse-****tp3960327.html<http://lucene.**
>>> 472066.n3.nabble.com/how-to-**present-html-content-in-**
>>> browse-tp3960327.html<http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html>
>>> >
>>>
>>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> [hidden email]
>>>
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to present html content in browse

Lance Norskog-2
You need positions and offsets to do highlighting. A CharFilter does
not preserve positions.

I think you have to analyze the raw HTML with a different Analyzer, as
well as the stripper. I think this is how it works: use a new Analyzer
stack that uses the StandardAnalyzer, and the lower case filter and
stemmer/synonym etc. Now, store the HTML field with that text type.
You then search on the stripped field, but highlight from the raw
field with 'hl.fl'.

Here's the cool part: you do not actually need to index the raw HTML,
only store it. If you do not index a field, the Highlighter analyzes
the HTML when it needs the positions and offsets.

On Fri, May 4, 2012 at 2:25 PM, okayndc <[hidden email]> wrote:

> Okay, thanks for the info.
>
> On Fri, May 4, 2012 at 4:42 PM, Jack Krupansky <[hidden email]>wrote:
>
>> Evidently there was a problem with highlighting of HTML that is supposedly
>> fixed in Solr 3.6 and trunk:
>>
>> https://issues.apache.org/**jira/browse/SOLR-42<https://issues.apache.org/jira/browse/SOLR-42>
>>
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: okayndc
>> Sent: Friday, May 04, 2012 4:35 PM
>>
>> To: [hidden email]
>> Subject: Re: how to present html content in browse
>>
>> Is it possible to return the HTML field highlighted?
>>
>> On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky <[hidden email]>**
>> wrote:
>>
>>  1. The raw html field (call it, "text_html") would be a "string" type
>>> field that is "stored" but not "indexed". This is the field you direct DIH
>>> to output to. This is the field you would return in your search results
>>> with the HTML to be displayed.
>>>
>>> 2. The stripped field (call it, "text_stripped") would be a "text" type
>>> field (where "text" is a field type you add that uses the HTML strip char
>>> filter as shown below) that is not "stored" but is "indexed. Add a
>>> CopyField to your schema that copies from the raw html field to the
>>> stripped field (say, "text_html" to "text_stripped".)
>>>
>>> For reference on HTML strip (HTMLStripCharFilterFactory), see:
>>> http://wiki.apache.org/solr/****AnalyzersTokenizersTokenFilter****s<http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s>
>>> <http://wiki.apache.org/**solr/**AnalyzersTokenizersTokenFilter**s<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>
>>> >
>>>
>>>
>>> Which has:
>>>
>>> <fieldtype name="text" class="solr.TextField">
>>>  <analyzer>
>>>  <charFilter class="solr.****HTMLStripCharFilterFactory"/>
>>>  <charFilter class="solr.****MappingCharFilterFactory"
>>> mapping="mapping-**
>>> ISOLatin1Accent.txt"/>
>>>  <tokenizer class="solr.****StandardTokenizerFactory"/>
>>>  <filter class="solr.****LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"****/>
>>>  <filter class="solr.****PorterStemFilterFactory"/>
>>>
>>>  </analyzer>
>>> </fieldtype>
>>>
>>> Although, you might want to call that field type "text_stripped" to avoid
>>> confusion with a simple text field
>>>
>>> You can add HTMLStripCharFilterFactory to some other field type that you
>>> might want to use, but this "charFilter" needs to be before the
>>> "tokenizer". The "text" field type above is just an example.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: okayndc
>>> Sent: Friday, May 04, 2012 1:01 PM
>>> To: [hidden email]
>>> Subject: Re: how to present html content in browse
>>>
>>>
>>> Hello,
>>>
>>> I'm having a hard time understanding this, and I had this same question.
>>>
>>> When using DIH should the HTML field be stored in the raw HTML string
>>> field
>>> or the stripped field?
>>> Also what source field(s) need to be copied and to what destination?
>>>
>>> Thanks
>>>
>>>
>>> On Thu, May 3, 2012 at 10:15 PM, Lance Norskog <[hidden email]> wrote:
>>>
>>>  Make two fields, one with stores the stripped HTML and another that
>>>
>>>> stores the parsed HTML. You can use <copyField> so that you do not
>>>> have to submit the html page twice.
>>>>
>>>> You would mark the stripped field 'indexed=true stored=false' and the
>>>> full text field the other way around. The full text field should be a
>>>> String type.
>>>>
>>>> On Thu, May 3, 2012 at 1:04 PM, srini <[hidden email]> wrote:
>>>> > I am indexing records from database using DIH. The content of my record
>>>> is in
>>>> > html format. When I use browse
>>>> > I would like to show the content in html format, not in text format. >
>>>> Any
>>>> > ideas?
>>>> >
>>>> > --
>>>> > View this message in context:
>>>> http://lucene.472066.n3.**nabb**le.com/how-to-present-**<http://nabble.com/how-to-present-**>
>>>> html-content-in-browse-****tp3960327.html<http://lucene.**
>>>> 472066.n3.nabble.com/how-to-**present-html-content-in-**
>>>> browse-tp3960327.html<http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html>
>>>> >
>>>>
>>>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> [hidden email]
>>>>
>>>>
>>>>
>>>
>>



--
Lance Norskog
[hidden email]
Loading...