char filter factory and tokeniser issue in admin Analysis form

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

char filter factory and tokeniser issue in admin Analysis form

lee carroll
Hi,

on solr 4.7 I've ran into a strange issue. Whilst setting up a field I've
noticed in the analysis form when I use a char filter factory (for example
HTMLSCF) with a tokeniser (ST) the analysis chain grinds to a halt. the
char filter does not seem to pass anything into the tokeniser.

Field type is:

<fieldType name="clean_text" class="solr.TextField"
positionIncrementGap="100">
              <analyzer>
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory"
language="English"/>
              </analyzer>
    </fieldType>

outpout of the analysis screen is:

Field value (index)
Content with mark up <br /> should be cleaned

HTMLSCF > Content with mark up should be cleaned
ST > <BLANK>

I know I must be missing something obvious !

Cheers Lee C
...
Reply | Threaded
Open this post in threaded view
|

Re: char filter factory and tokeniser issue in admin Analysis form

lee carroll
B*ll*cks, before posting I spent an hour searching for issues, honest.
Soon as I post within seconds I find

https://issues.apache.org/jira/browse/SOLR-5800



On 20 October 2015 at 15:21, Lee Carroll <[hidden email]>
wrote:

> Hi,
>
> on solr 4.7 I've ran into a strange issue. Whilst setting up a field I've
> noticed in the analysis form when I use a char filter factory (for example
> HTMLSCF) with a tokeniser (ST) the analysis chain grinds to a halt. the
> char filter does not seem to pass anything into the tokeniser.
>
> Field type is:
>
> <fieldType name="clean_text" class="solr.TextField"
> positionIncrementGap="100">
>               <analyzer>
>                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
>               </analyzer>
>     </fieldType>
>
> outpout of the analysis screen is:
>
> Field value (index)
> Content with mark up <br /> should be cleaned
>
> HTMLSCF > Content with mark up should be cleaned
> ST > <BLANK>
>
> I know I must be missing something obvious !
>
> Cheers Lee C
> ...
>
Reply | Threaded
Open this post in threaded view
|

Re: char filter factory and tokeniser issue in admin Analysis form

Alexandre Rafalovitch
On 20 October 2015 at 10:26, Lee Carroll <[hidden email]> wrote:
> B*ll*cks, before posting I spent an hour searching for issues, honest.
> Soon as I post within seconds I find
>
> https://issues.apache.org/jira/browse/SOLR-5800

We are always glad to be of help. Including by RubberDucking:
http://c2.com/cgi/wiki?RubberDucking

Now remember the question that you asked yourself for that insight and
remember to ask it next time. I suspect it was "4.7? I wonder if it is
version-specific issue, since solved". I classify this under
"Magnitude" in my presentation at Solr Revolution this past week:
http://www.slideshare.net/arafalov/solr-troubleshooting-treemap-approach
(slide 10).

Regards,
   Alex.


----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/
Reply | Threaded
Open this post in threaded view
|

Re: char filter factory and tokeniser issue in admin Analysis form

lee carroll
No Alexandre its just Sod's law (http://www.thefreedictionary.com/Sod's+Law)
:-)

Lee C


On 20 October 2015 at 15:38, Alexandre Rafalovitch <[hidden email]>
wrote:

> On 20 October 2015 at 10:26, Lee Carroll <[hidden email]>
> wrote:
> > B*ll*cks, before posting I spent an hour searching for issues, honest.
> > Soon as I post within seconds I find
> >
> > https://issues.apache.org/jira/browse/SOLR-5800
>
> We are always glad to be of help. Including by RubberDucking:
> http://c2.com/cgi/wiki?RubberDucking
>
> Now remember the question that you asked yourself for that insight and
> remember to ask it next time. I suspect it was "4.7? I wonder if it is
> version-specific issue, since solved". I classify this under
> "Magnitude" in my presentation at Solr Revolution this past week:
> http://www.slideshare.net/arafalov/solr-troubleshooting-treemap-approach
> (slide 10).
>
> Regards,
>    Alex.
>
>
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>