Quantcast

HTMLStripCharFilterFactory not working in Solr4?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

HTMLStripCharFilterFactory not working in Solr4?

Mike Hugo
We recently updated to the latest build of Solr4 and everything is working
really well so far!  There is one case that is not working the same way it
was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
registered, for example) in a field as defined below - it was working in
Solr3.4 with the configuration shown here, but is not working the same way
in Solr4.

The label field is defined as type="text_general"
<field name="label" type="text_general" indexed="true" stored="false"
required="false" multiValued="true"/>

Here's the type definition for text_general field:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"
                        enablePositionIncrements="true"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"
                        enablePositionIncrements="true"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>


In Solr 3.4, that configuration was completely stripping html constructs
out of the indexed field which is exactly what we wanted.  If for example,
we then do a facet on the label field, like in the test below, we're
getting some terms in the response that we would not like to be there.


// test case (groovy)
void specialHtmlConstructsGetStripped() {
    SolrInputDocument inputDocument = new SolrInputDocument()
    inputDocument.addField('label', 'Bose&#174; &#8482;')

    solrServer.add(inputDocument)
    solrServer.commit()

    QueryResponse response = solrServer.query(new SolrQuery('bose'))
    assert 1 == response.results.numFound

    SolrQuery facetQuery = new SolrQuery('bose')
    facetQuery.facet = true
    facetQuery.set(FacetParams.FACET_FIELD, 'label')
    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')

    response = solrServer.query(facetQuery)
    FacetField ff = response.facetFields.find {it.name == 'label'}

    List suggestResponse = []

    for (FacetField.Count facetField in ff?.values) {
        suggestResponse << facetField.name
    }

    assert suggestResponse == ['bose']
}

With the upgrade to Solr4, the assertion fails, the suggested response
contains 174 and 8482 as terms.  Test output is:

Assertion failed:

assert suggestResponse == ['bose']
       |               |
       |               false
       [174, 8482, bose]


I just tried again using the latest build from today, namely:
https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
getting the failing assertion. Is there a different way to configure the
HTMLStripCharFilterFactory in Solr4?

Thanks in advance for any tips!

Mike
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HTMLStripCharFilterFactory not working in Solr4?

Yonik Seeley-2-2
You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.

-Yonik
http://www.lucidimagination.com



On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo <[hidden email]> wrote:

> We recently updated to the latest build of Solr4 and everything is working
> really well so far!  There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
>
> The label field is defined as type="text_general"
> <field name="label" type="text_general" indexed="true" stored="false"
> required="false" multiValued="true"/>
>
> Here's the type definition for text_general field:
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>            <analyzer type="query">
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>        </fieldType>
>
>
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted.  If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
>
>
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
>    SolrInputDocument inputDocument = new SolrInputDocument()
>    inputDocument.addField('label', 'Bose&#174; &#8482;')
>
>    solrServer.add(inputDocument)
>    solrServer.commit()
>
>    QueryResponse response = solrServer.query(new SolrQuery('bose'))
>    assert 1 == response.results.numFound
>
>    SolrQuery facetQuery = new SolrQuery('bose')
>    facetQuery.facet = true
>    facetQuery.set(FacetParams.FACET_FIELD, 'label')
>    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
>
>    response = solrServer.query(facetQuery)
>    FacetField ff = response.facetFields.find {it.name == 'label'}
>
>    List suggestResponse = []
>
>    for (FacetField.Count facetField in ff?.values) {
>        suggestResponse << facetField.name
>    }
>
>    assert suggestResponse == ['bose']
> }
>
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms.  Test output is:
>
> Assertion failed:
>
> assert suggestResponse == ['bose']
>       |               |
>       |               false
>       [174, 8482, bose]
>
>
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
>
> Thanks in advance for any tips!
>
> Mike
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HTMLStripCharFilterFactory not working in Solr4?

Mike Hugo
Thanks for the response Yonik,
Interestingly enough, changing to to the LegacyHTMLStripCharFilterFactory
does NOT solve the problem - in fact I get the same result

I can see that the LegacyHTMLStripCharFilterFactory is being applied at
startup:

Jan 24, 2012 1:25:29 PM org.apache.solr.util.plugin.AbstractPluginLoader
load
INFO: created : org.apache.solr.analysis.LegacyHTMLStripCharFilterFactory

however, I'm still getting the same assertion error.  Any thoughts?

Mike


On Tue, Jan 24, 2012 at 12:40 PM, Yonik Seeley
<[hidden email]>wrote:

> You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
> See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo <[hidden email]> wrote:
> > We recently updated to the latest build of Solr4 and everything is
> working
> > really well so far!  There is one case that is not working the same way
> it
> > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> and
> > registered, for example) in a field as defined below - it was working in
> > Solr3.4 with the configuration shown here, but is not working the same
> way
> > in Solr4.
> >
> > The label field is defined as type="text_general"
> > <field name="label" type="text_general" indexed="true" stored="false"
> > required="false" multiValued="true"/>
> >
> > Here's the type definition for text_general field:
> > <fieldType name="text_general" class="solr.TextField"
> > positionIncrementGap="100">
> >            <analyzer type="index">
> >                <tokenizer class="solr.StandardTokenizerFactory"/>
> >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                        enablePositionIncrements="true"/>
> >                <filter class="solr.LowerCaseFilterFactory"/>
> >            </analyzer>
> >            <analyzer type="query">
> >                <tokenizer class="solr.StandardTokenizerFactory"/>
> >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                        enablePositionIncrements="true"/>
> >                <filter class="solr.LowerCaseFilterFactory"/>
> >            </analyzer>
> >        </fieldType>
> >
> >
> > In Solr 3.4, that configuration was completely stripping html constructs
> > out of the indexed field which is exactly what we wanted.  If for
> example,
> > we then do a facet on the label field, like in the test below, we're
> > getting some terms in the response that we would not like to be there.
> >
> >
> > // test case (groovy)
> > void specialHtmlConstructsGetStripped() {
> >    SolrInputDocument inputDocument = new SolrInputDocument()
> >    inputDocument.addField('label', 'Bose&#174; &#8482;')
> >
> >    solrServer.add(inputDocument)
> >    solrServer.commit()
> >
> >    QueryResponse response = solrServer.query(new SolrQuery('bose'))
> >    assert 1 == response.results.numFound
> >
> >    SolrQuery facetQuery = new SolrQuery('bose')
> >    facetQuery.facet = true
> >    facetQuery.set(FacetParams.FACET_FIELD, 'label')
> >    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> >
> >    response = solrServer.query(facetQuery)
> >    FacetField ff = response.facetFields.find {it.name == 'label'}
> >
> >    List suggestResponse = []
> >
> >    for (FacetField.Count facetField in ff?.values) {
> >        suggestResponse << facetField.name
> >    }
> >
> >    assert suggestResponse == ['bose']
> > }
> >
> > With the upgrade to Solr4, the assertion fails, the suggested response
> > contains 174 and 8482 as terms.  Test output is:
> >
> > Assertion failed:
> >
> > assert suggestResponse == ['bose']
> >       |               |
> >       |               false
> >       [174, 8482, bose]
> >
> >
> > I just tried again using the latest build from today, namely:
> > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> still
> > getting the failing assertion. Is there a different way to configure the
> > HTMLStripCharFilterFactory in Solr4?
> >
> > Thanks in advance for any tips!
> >
> > Mike
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: HTMLStripCharFilterFactory not working in Solr4?

steve_rowe
In reply to this post by Mike Hugo
Hi Mike,

When I add the following test to TestHTMLStripCharFilterFactory.java on Solr trunk, it passes:
 
public void testNumericCharacterEntities() throws Exception {
  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
  HTMLStripCharFilterFactory htmlStripFactory = new HTMLStripCharFilterFactory();
  htmlStripFactory.init(Collections.<String,String>emptyMap());
  CharStream charStream = htmlStripFactory.create(CharReader.get(new StringReader(text)));
  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
  stdTokFactory.init(DEFAULT_VERSION_PARAM);
  Tokenizer stream = stdTokFactory.create(charStream);
  assertTokenStreamContents(stream, new String[] { "Bose" });
}

What's happening:

First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".  Then stdTokFactory declines to tokenize "®" and "™", because they are belong to the Unicode general category "Symbol, Other", and so are not included in any of the output tokens.

StandardTokenizer uses the Word Break rules find UAX#29 <http://unicode.org/reports/tr29/> to find token boundaries, and then outputs only alphanumeric tokens.  See the JFlex grammar for details: <http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup>.

The behavior you're seeing is not consistent with the above test.

Steve

> -----Original Message-----
> From: Mike Hugo [mailto:[hidden email]]
> Sent: Tuesday, January 24, 2012 1:34 PM
> To: [hidden email]
> Subject: HTMLStripCharFilterFactory not working in Solr4?
>
> We recently updated to the latest build of Solr4 and everything is working
> really well so far!  There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
>
> The label field is defined as type="text_general"
> <field name="label" type="text_general" indexed="true" stored="false"
> required="false" multiValued="true"/>
>
> Here's the type definition for text_general field:
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                         enablePositionIncrements="true"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                         enablePositionIncrements="true"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
>
>
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted.  If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
>
>
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
>     SolrInputDocument inputDocument = new SolrInputDocument()
>     inputDocument.addField('label', 'Bose&#174; &#8482;')
>
>     solrServer.add(inputDocument)
>     solrServer.commit()
>
>     QueryResponse response = solrServer.query(new SolrQuery('bose'))
>     assert 1 == response.results.numFound
>
>     SolrQuery facetQuery = new SolrQuery('bose')
>     facetQuery.facet = true
>     facetQuery.set(FacetParams.FACET_FIELD, 'label')
>     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
>
>     response = solrServer.query(facetQuery)
>     FacetField ff = response.facetFields.find {it.name == 'label'}
>
>     List suggestResponse = []
>
>     for (FacetField.Count facetField in ff?.values) {
>         suggestResponse << facetField.name
>     }
>
>     assert suggestResponse == ['bose']
> }
>
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms.  Test output is:
>
> Assertion failed:
>
> assert suggestResponse == ['bose']
>        |               |
>        |               false
>        [174, 8482, bose]
>
>
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
>
> Thanks in advance for any tips!
>
> Mike
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: HTMLStripCharFilterFactory not working in Solr4?

Michael Ryan
Try putting the HTMLStripCharFilterFactory before the StandardTokenizerFactory instead of after it. I vaguely recall being burned by something like this before.

-Michael
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HTMLStripCharFilterFactory not working in Solr4?

Yonik Seeley-2-2
In reply to this post by Mike Hugo
Oops, I didn't read carefully enough to see that you wanted those constructs
entirely stripped out.

Given that you're seeing numbers indexed, this strongly indicates an
escaping bug in the SolrJ client that must have been introduced at
some point.
I'll see if I can reproduce it in a unit test.


-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HTMLStripCharFilterFactory not working in Solr4?

Mike Hugo
In reply to this post by steve_rowe
Thanks for the responses everyone.

Steve, the test method you provided also works for me.  However, when I try
a more end to end test with the HTMLStripCharFilterFactory configured for a
field I am still having the same problem.  I attached a failing unit test
and configuration to the following issue in JIRA:

https://issues.apache.org/jira/browse/LUCENE-3721

I appreciate all the prompt responses!  Looking forward to finding the root
cause of this guy :)  If there's something I'm doing incorrectly in the
configuration, please let me know!

Mike

On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <[hidden email]> wrote:

> Hi Mike,
>
> When I add the following test to TestHTMLStripCharFilterFactory.java on
> Solr trunk, it passes:
>
> public void testNumericCharacterEntities() throws Exception {
>  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
>  HTMLStripCharFilterFactory htmlStripFactory = new
> HTMLStripCharFilterFactory();
>  htmlStripFactory.init(Collections.<String,String>emptyMap());
>  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> StringReader(text)));
>  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
>  stdTokFactory.init(DEFAULT_VERSION_PARAM);
>  Tokenizer stream = stdTokFactory.create(charStream);
>  assertTokenStreamContents(stream, new String[] { "Bose" });
> }
>
> What's happening:
>
> First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".
>  Then stdTokFactory declines to tokenize "®" and "™", because they are
> belong to the Unicode general category "Symbol, Other", and so are not
> included in any of the output tokens.
>
> StandardTokenizer uses the Word Break rules find UAX#29 <
> http://unicode.org/reports/tr29/> to find token boundaries, and then
> outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup
> >.
>
> The behavior you're seeing is not consistent with the above test.
>
> Steve
>
> > -----Original Message-----
> > From: Mike Hugo [mailto:[hidden email]]
> > Sent: Tuesday, January 24, 2012 1:34 PM
> > To: [hidden email]
> > Subject: HTMLStripCharFilterFactory not working in Solr4?
> >
> > We recently updated to the latest build of Solr4 and everything is
> working
> > really well so far!  There is one case that is not working the same way
> it
> > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> and
> > registered, for example) in a field as defined below - it was working in
> > Solr3.4 with the configuration shown here, but is not working the same
> way
> > in Solr4.
> >
> > The label field is defined as type="text_general"
> > <field name="label" type="text_general" indexed="true" stored="false"
> > required="false" multiValued="true"/>
> >
> > Here's the type definition for text_general field:
> > <fieldType name="text_general" class="solr.TextField"
> > positionIncrementGap="100">
> >             <analyzer type="index">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                         enablePositionIncrements="true"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >             </analyzer>
> >             <analyzer type="query">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                         enablePositionIncrements="true"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >             </analyzer>
> >         </fieldType>
> >
> >
> > In Solr 3.4, that configuration was completely stripping html constructs
> > out of the indexed field which is exactly what we wanted.  If for
> example,
> > we then do a facet on the label field, like in the test below, we're
> > getting some terms in the response that we would not like to be there.
> >
> >
> > // test case (groovy)
> > void specialHtmlConstructsGetStripped() {
> >     SolrInputDocument inputDocument = new SolrInputDocument()
> >     inputDocument.addField('label', 'Bose&#174; &#8482;')
> >
> >     solrServer.add(inputDocument)
> >     solrServer.commit()
> >
> >     QueryResponse response = solrServer.query(new SolrQuery('bose'))
> >     assert 1 == response.results.numFound
> >
> >     SolrQuery facetQuery = new SolrQuery('bose')
> >     facetQuery.facet = true
> >     facetQuery.set(FacetParams.FACET_FIELD, 'label')
> >     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> >
> >     response = solrServer.query(facetQuery)
> >     FacetField ff = response.facetFields.find {it.name == 'label'}
> >
> >     List suggestResponse = []
> >
> >     for (FacetField.Count facetField in ff?.values) {
> >         suggestResponse << facetField.name
> >     }
> >
> >     assert suggestResponse == ['bose']
> > }
> >
> > With the upgrade to Solr4, the assertion fails, the suggested response
> > contains 174 and 8482 as terms.  Test output is:
> >
> > Assertion failed:
> >
> > assert suggestResponse == ['bose']
> >        |               |
> >        |               false
> >        [174, 8482, bose]
> >
> >
> > I just tried again using the latest build from today, namely:
> > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> still
> > getting the failing assertion. Is there a different way to configure the
> > HTMLStripCharFilterFactory in Solr4?
> >
> > Thanks in advance for any tips!
> >
> > Mike
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: HTMLStripCharFilterFactory not working in Solr4?

steve_rowe
Hi Mike,

Yonik committed a fix to Solr trunk - your test on LUCENE-3721 succeeds for me now.  (On Solr trunk, *all* CharFilters have been non-functional since LUCENE-3396 was committed in r1175297 on 25 Sept 2011, until Yonik's fix today in r1235810; Solr 3.x was not affected - CharFilters have been working there all along.)

Steve

> -----Original Message-----
> From: Mike Hugo [mailto:[hidden email]]
> Sent: Tuesday, January 24, 2012 3:56 PM
> To: [hidden email]
> Subject: Re: HTMLStripCharFilterFactory not working in Solr4?
>
> Thanks for the responses everyone.
>
> Steve, the test method you provided also works for me.  However, when I
> try
> a more end to end test with the HTMLStripCharFilterFactory configured for
> a
> field I am still having the same problem.  I attached a failing unit test
> and configuration to the following issue in JIRA:
>
> https://issues.apache.org/jira/browse/LUCENE-3721
>
> I appreciate all the prompt responses!  Looking forward to finding the
> root
> cause of this guy :)  If there's something I'm doing incorrectly in the
> configuration, please let me know!
>
> Mike
>
> On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <[hidden email]> wrote:
>
> > Hi Mike,
> >
> > When I add the following test to TestHTMLStripCharFilterFactory.java on
> > Solr trunk, it passes:
> >
> > public void testNumericCharacterEntities() throws Exception {
> >  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
> >  HTMLStripCharFilterFactory htmlStripFactory = new
> > HTMLStripCharFilterFactory();
> >  htmlStripFactory.init(Collections.<String,String>emptyMap());
> >  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> > StringReader(text)));
> >  StandardTokenizerFactory stdTokFactory = new
> StandardTokenizerFactory();
> >  stdTokFactory.init(DEFAULT_VERSION_PARAM);
> >  Tokenizer stream = stdTokFactory.create(charStream);
> >  assertTokenStreamContents(stream, new String[] { "Bose" });
> > }
> >
> > What's happening:
> >
> > First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".
> >  Then stdTokFactory declines to tokenize "®" and "™", because they are
> > belong to the Unicode general category "Symbol, Other", and so are not
> > included in any of the output tokens.
> >
> > StandardTokenizer uses the Word Break rules find UAX#29 <
> > http://unicode.org/reports/tr29/> to find token boundaries, and then
> > outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> >
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/
> java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=
> markup
> > >.
> >
> > The behavior you're seeing is not consistent with the above test.
> >
> > Steve
> >
> > > -----Original Message-----
> > > From: Mike Hugo [mailto:[hidden email]]
> > > Sent: Tuesday, January 24, 2012 1:34 PM
> > > To: [hidden email]
> > > Subject: HTMLStripCharFilterFactory not working in Solr4?
> > >
> > > We recently updated to the latest build of Solr4 and everything is
> > working
> > > really well so far!  There is one case that is not working the same
> way
> > it
> > > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> > and
> > > registered, for example) in a field as defined below - it was working
> in
> > > Solr3.4 with the configuration shown here, but is not working the same
> > way
> > > in Solr4.
> > >
> > > The label field is defined as type="text_general"
> > > <field name="label" type="text_general" indexed="true" stored="false"
> > > required="false" multiValued="true"/>
> > >
> > > Here's the type definition for text_general field:
> > > <fieldType name="text_general" class="solr.TextField"
> > > positionIncrementGap="100">
> > >             <analyzer type="index">
> > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > > words="stopwords.txt"
> > >                         enablePositionIncrements="true"/>
> > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >             </analyzer>
> > >             <analyzer type="query">
> > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > > words="stopwords.txt"
> > >                         enablePositionIncrements="true"/>
> > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >             </analyzer>
> > >         </fieldType>
> > >
> > >
> > > In Solr 3.4, that configuration was completely stripping html
> constructs
> > > out of the indexed field which is exactly what we wanted.  If for
> > example,
> > > we then do a facet on the label field, like in the test below, we're
> > > getting some terms in the response that we would not like to be there.
> > >
> > >
> > > // test case (groovy)
> > > void specialHtmlConstructsGetStripped() {
> > >     SolrInputDocument inputDocument = new SolrInputDocument()
> > >     inputDocument.addField('label', 'Bose&#174; &#8482;')
> > >
> > >     solrServer.add(inputDocument)
> > >     solrServer.commit()
> > >
> > >     QueryResponse response = solrServer.query(new SolrQuery('bose'))
> > >     assert 1 == response.results.numFound
> > >
> > >     SolrQuery facetQuery = new SolrQuery('bose')
> > >     facetQuery.facet = true
> > >     facetQuery.set(FacetParams.FACET_FIELD, 'label')
> > >     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> > >
> > >     response = solrServer.query(facetQuery)
> > >     FacetField ff = response.facetFields.find {it.name == 'label'}
> > >
> > >     List suggestResponse = []
> > >
> > >     for (FacetField.Count facetField in ff?.values) {
> > >         suggestResponse << facetField.name
> > >     }
> > >
> > >     assert suggestResponse == ['bose']
> > > }
> > >
> > > With the upgrade to Solr4, the assertion fails, the suggested response
> > > contains 174 and 8482 as terms.  Test output is:
> > >
> > > Assertion failed:
> > >
> > > assert suggestResponse == ['bose']
> > >        |               |
> > >        |               false
> > >        [174, 8482, bose]
> > >
> > >
> > > I just tried again using the latest build from today, namely:
> > > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> > still
> > > getting the failing assertion. Is there a different way to configure
> the
> > > HTMLStripCharFilterFactory in Solr4?
> > >
> > > Thanks in advance for any tips!
> > >
> > > Mike
> >
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HTMLStripCharFilterFactory not working in Solr4?

Mike Hugo
Thanks guys!  I'll grab the latest build from the solr4 jenkins server when
those commits get picked up and try it out.  Thanks for the quick
turnaround!

Mike

On Wed, Jan 25, 2012 at 11:01 AM, Steven A Rowe <[hidden email]> wrote:

> Hi Mike,
>
> Yonik committed a fix to Solr trunk - your test on LUCENE-3721 succeeds
> for me now.  (On Solr trunk, *all* CharFilters have been non-functional
> since LUCENE-3396 was committed in r1175297 on 25 Sept 2011, until Yonik's
> fix today in r1235810; Solr 3.x was not affected - CharFilters have been
> working there all along.)
>
> Steve
>
> > -----Original Message-----
> > From: Mike Hugo [mailto:[hidden email]]
> > Sent: Tuesday, January 24, 2012 3:56 PM
> > To: [hidden email]
> > Subject: Re: HTMLStripCharFilterFactory not working in Solr4?
> >
> > Thanks for the responses everyone.
> >
> > Steve, the test method you provided also works for me.  However, when I
> > try
> > a more end to end test with the HTMLStripCharFilterFactory configured for
> > a
> > field I am still having the same problem.  I attached a failing unit test
> > and configuration to the following issue in JIRA:
> >
> > https://issues.apache.org/jira/browse/LUCENE-3721
> >
> > I appreciate all the prompt responses!  Looking forward to finding the
> > root
> > cause of this guy :)  If there's something I'm doing incorrectly in the
> > configuration, please let me know!
> >
> > Mike
> >
> > On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <[hidden email]> wrote:
> >
> > > Hi Mike,
> > >
> > > When I add the following test to TestHTMLStripCharFilterFactory.java on
> > > Solr trunk, it passes:
> > >
> > > public void testNumericCharacterEntities() throws Exception {
> > >  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
> > >  HTMLStripCharFilterFactory htmlStripFactory = new
> > > HTMLStripCharFilterFactory();
> > >  htmlStripFactory.init(Collections.<String,String>emptyMap());
> > >  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> > > StringReader(text)));
> > >  StandardTokenizerFactory stdTokFactory = new
> > StandardTokenizerFactory();
> > >  stdTokFactory.init(DEFAULT_VERSION_PARAM);
> > >  Tokenizer stream = stdTokFactory.create(charStream);
> > >  assertTokenStreamContents(stream, new String[] { "Bose" });
> > > }
> > >
> > > What's happening:
> > >
> > > First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".
> > >  Then stdTokFactory declines to tokenize "®" and "™", because they are
> > > belong to the Unicode general category "Symbol, Other", and so are not
> > > included in any of the output tokens.
> > >
> > > StandardTokenizer uses the Word Break rules find UAX#29 <
> > > http://unicode.org/reports/tr29/> to find token boundaries, and then
> > > outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> > >
> >
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/
> >
> java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=
> > markup
> > > >.
> > >
> > > The behavior you're seeing is not consistent with the above test.
> > >
> > > Steve
> > >
> > > > -----Original Message-----
> > > > From: Mike Hugo [mailto:[hidden email]]
> > > > Sent: Tuesday, January 24, 2012 1:34 PM
> > > > To: [hidden email]
> > > > Subject: HTMLStripCharFilterFactory not working in Solr4?
> > > >
> > > > We recently updated to the latest build of Solr4 and everything is
> > > working
> > > > really well so far!  There is one case that is not working the same
> > way
> > > it
> > > > was in Solr 3.4 - we strip out certain HTML constructs (like
> trademark
> > > and
> > > > registered, for example) in a field as defined below - it was working
> > in
> > > > Solr3.4 with the configuration shown here, but is not working the
> same
> > > way
> > > > in Solr4.
> > > >
> > > > The label field is defined as type="text_general"
> > > > <field name="label" type="text_general" indexed="true" stored="false"
> > > > required="false" multiValued="true"/>
> > > >
> > > > Here's the type definition for text_general field:
> > > > <fieldType name="text_general" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >             <analyzer type="index">
> > > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > > words="stopwords.txt"
> > > >                         enablePositionIncrements="true"/>
> > > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > >             </analyzer>
> > > >             <analyzer type="query">
> > > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > > words="stopwords.txt"
> > > >                         enablePositionIncrements="true"/>
> > > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > >             </analyzer>
> > > >         </fieldType>
> > > >
> > > >
> > > > In Solr 3.4, that configuration was completely stripping html
> > constructs
> > > > out of the indexed field which is exactly what we wanted.  If for
> > > example,
> > > > we then do a facet on the label field, like in the test below, we're
> > > > getting some terms in the response that we would not like to be
> there.
> > > >
> > > >
> > > > // test case (groovy)
> > > > void specialHtmlConstructsGetStripped() {
> > > >     SolrInputDocument inputDocument = new SolrInputDocument()
> > > >     inputDocument.addField('label', 'Bose&#174; &#8482;')
> > > >
> > > >     solrServer.add(inputDocument)
> > > >     solrServer.commit()
> > > >
> > > >     QueryResponse response = solrServer.query(new SolrQuery('bose'))
> > > >     assert 1 == response.results.numFound
> > > >
> > > >     SolrQuery facetQuery = new SolrQuery('bose')
> > > >     facetQuery.facet = true
> > > >     facetQuery.set(FacetParams.FACET_FIELD, 'label')
> > > >     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> > > >
> > > >     response = solrServer.query(facetQuery)
> > > >     FacetField ff = response.facetFields.find {it.name == 'label'}
> > > >
> > > >     List suggestResponse = []
> > > >
> > > >     for (FacetField.Count facetField in ff?.values) {
> > > >         suggestResponse << facetField.name
> > > >     }
> > > >
> > > >     assert suggestResponse == ['bose']
> > > > }
> > > >
> > > > With the upgrade to Solr4, the assertion fails, the suggested
> response
> > > > contains 174 and 8482 as terms.  Test output is:
> > > >
> > > > Assertion failed:
> > > >
> > > > assert suggestResponse == ['bose']
> > > >        |               |
> > > >        |               false
> > > >        [174, 8482, bose]
> > > >
> > > >
> > > > I just tried again using the latest build from today, namely:
> > > > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> > > still
> > > > getting the failing assertion. Is there a different way to configure
> > the
> > > > HTMLStripCharFilterFactory in Solr4?
> > > >
> > > > Thanks in advance for any tips!
> > > >
> > > > Mike
> > >
>
Loading...