How does HTMLStripWhitespaceTokenizerFactory work?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How does HTMLStripWhitespaceTokenizerFactory work?

Thierry Collogne
Hello,

I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer
with no luck.

I have a field content that contains the following <field
name="content"><![CDATA[test      <a href="test">link</a>
                                 post]]></field>

When I do a search I get the following

<result name="response" numFound="1" start="0">
 <doc>
  <str name="content">test      &lt;a href="test"&gt;link&lt;/a&gt;
                              post</str>

  <str name="id">po_1_NL</str>
  <str name="keywords">post</str>
  <str name="titlesearch">This is a test</str>
 </doc>
</result>


Is this normal? Shouldn't the html code and the white spaces be removed from
the field?

This is my config in schema.xml

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
      </analyzer>
 </fieldType>

<field name="content" type="text_ws" indexed="true" stored="true"
omitNorms="false"/>

Can someone help me with this?
Reply | Threaded
Open this post in threaded view
|

Re: How does HTMLStripWhitespaceTokenizerFactory work?

Yonik Seeley-2
On 6/8/07, Thierry Collogne <[hidden email]> wrote:
> I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer
> with no luck.
[...]
> Is this normal? Shouldn't the html code and the white spaces be removed from
> the field?

For indexing purposes, yes.  The stored field you get back will be
unchanged though.
If you want to see what will be indexed, try the analysis debugger in
the admin pages.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: How does HTMLStripWhitespaceTokenizerFactory work?

Thierry Collogne
Ok. Is it possible to get back the content without the html tags?

On 08/06/07, Yonik Seeley <[hidden email]> wrote:

>
> On 6/8/07, Thierry Collogne <[hidden email]> wrote:
> > I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer
> > with no luck.
> [...]
> > Is this normal? Shouldn't the html code and the white spaces be removed
> from
> > the field?
>
> For indexing purposes, yes.  The stored field you get back will be
> unchanged though.
> If you want to see what will be indexed, try the analysis debugger in
> the admin pages.
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: How does HTMLStripWhitespaceTokenizerFactory work?

Mike Klaas
On 11-Jun-07, at 3:54 AM, Thierry Collogne wrote:

> Ok. Is it possible to get back the content without the html tags?
>

Well, it isn't stored anywhere in Solr.  It's best to think of lucene/
solr as two systems: the indexer applies a tokenization  
transformation to the data and creates an inverted index; the storage  
system keeps track of the data you give it _before_ analysis/
tokenization.  If there is analysis you'd like to do that also  
applies to the stored status of the doc, it's probably easier to  
apply it before passing the data to Solr.

-MIke

> On 08/06/07, Yonik Seeley <[hidden email]> wrote:
>>
>> On 6/8/07, Thierry Collogne <[hidden email]> wrote:
>> > I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory  
>> analyzer
>> > with no luck.
>> [...]
>> > Is this normal? Shouldn't the html code and the white spaces be  
>> removed
>> from
>> > the field?
>>
>> For indexing purposes, yes.  The stored field you get back will be
>> unchanged though.
>> If you want to see what will be indexed, try the analysis debugger in
>> the admin pages.
>>
>> -Yonik
>>

Reply | Threaded
Open this post in threaded view
|

Re: How does HTMLStripWhitespaceTokenizerFactory work?

Chris Hostetter-3
In reply to this post by Thierry Collogne

: Ok. Is it possible to get back the content without the html tags?

Solr never does anything to modify the "stored" value of a field, so you'd
really need to send Solr the value after strpping the HTML to get this to
work.

Internally, the HTMLStripWhitespaceTokenizerFactory does the HTML
stripping as part of the tokenization process, so there is never a
single markup free value for the field in Solr.





-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: How does HTMLStripWhitespaceTokenizerFactory work?

Thierry Collogne
Ok. Thanks for the clarification. We will do the stripping before the
indexing.

On 11/06/07, Chris Hostetter <[hidden email]> wrote:

>
>
> : Ok. Is it possible to get back the content without the html tags?
>
> Solr never does anything to modify the "stored" value of a field, so you'd
> really need to send Solr the value after strpping the HTML to get this to
> work.
>
> Internally, the HTMLStripWhitespaceTokenizerFactory does the HTML
> stripping as part of the tokenization process, so there is never a
> single markup free value for the field in Solr.
>
>
>
>
>
> -Hoss
>
>