Solr Reference Guide issue for simplified tokenizers

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr Reference Guide issue for simplified tokenizers

Nikolay Khitrin
I'm feeling I found an issue in Solr Reference Guide for Simplified Regular
Expression Pattern [Splitting ]Tokenizer (https://lucene.apache.org/
solr/guide/7_3/tokenizers.html#simplified-regular-
expression-pattern-splitting-tokenizer).

Given example is

<analyzer>
  <tokenizer class="solr.SimplePatternSplitTokenizerFactory"
pattern="[ \t\r\n]+"/></analyzer>


but Lucene's RegExp constructor consumes raw unicode characters instead of
\t\r\n form, so correct configuration is

<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ &#x9;&
#xA;&#xD;]+"/>

--
Nikolay Khitrin
Reply | Threaded
Open this post in threaded view
|

Re: Solr Reference Guide issue for simplified tokenizers

Shawn Heisey-2
On 4/15/2018 5:42 AM, Nikolay Khitrin wrote:
> Given example is <analyzer> <tokenizer
> class="solr.SimplePatternSplitTokenizerFactory" pattern="[
> \t\r\n]+"/></analyzer> but Lucene's RegExp constructor consumes raw
> unicode characters instead of \t\r\n form, so correct configuration is
> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
> &#x9;& #xA;&#xD;]+"/>

Looks like you're right about that example not working.  I also tried it
with double backslashes -- something that would be required if the
string were found in actual java code.  Your suggested replacement DOES
work -- the characters are encoded with XML syntax and passed as
ascii/unicode to the constructor for the tokenizer.

I cannot make any sense out of the Lucene RegExp javadoc.  I think it
needs some full string examples to illustrate what it is trying to say.

I don't think this is a good example for this particular tokenizer, even
if it's changed to your replacement that does work.  For what the
example is TRYING to do, WhitespaceTokenizerFactory is a better choice. 
It will match more whitespace characters than spaces, tabs, and newlines.

Here's an example using that tokenizer that will split on semicolon and
eliminate leading/trailing whitespace from each token:

<analyzer>
   <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern=";"/>
   <filter class="solr.TrimFilterFactory"/>
</analyzer>

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Solr Reference Guide issue for simplified tokenizers

Nikolay Khitrin
Yes, Lucene RegExp javadoc seems a bit complicated and even tests do not
cover all syntax variants. But the whole point is: parser doesn't mangle
any characters and using backslashes only for distinguish syntax symbols
from raw characters.

The example might be not a best possible, but I think reference guide
should be corrected (may be with additional note about character escape)
because it is difficult to find out correct solution by end users those not
familiar with Lucene codebase.


Unfortunately, sometimes fine grained tokenizing control is the one
workaround for weird issues like LUCENE-7766.
For example I have to strip quotes on tokenizer stage to obtain WDGF
offsets on parts (for strings like &quot;Foo-Bar&quot; and
HTMLStripCharFilter before tokenizer) as temporary solution.


2018-04-15 21:08 GMT+03:00 Shawn Heisey <[hidden email]>:

> On 4/15/2018 5:42 AM, Nikolay Khitrin wrote:
>
>> Given example is <analyzer> <tokenizer class="solr.SimplePatternSplitTokenizerFactory"
>> pattern="[ \t\r\n]+"/></analyzer> but Lucene's RegExp constructor consumes
>> raw unicode characters instead of \t\r\n form, so correct configuration is
>> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
>> &#x9;& #xA;&#xD;]+"/>
>>
>
> Looks like you're right about that example not working.  I also tried it
> with double backslashes -- something that would be required if the string
> were found in actual java code.  Your suggested replacement DOES work --
> the characters are encoded with XML syntax and passed as ascii/unicode to
> the constructor for the tokenizer.
>
> I cannot make any sense out of the Lucene RegExp javadoc.  I think it
> needs some full string examples to illustrate what it is trying to say.
>
> I don't think this is a good example for this particular tokenizer, even
> if it's changed to your replacement that does work.  For what the example
> is TRYING to do, WhitespaceTokenizerFactory is a better choice.  It will
> match more whitespace characters than spaces, tabs, and newlines.
>
> Here's an example using that tokenizer that will split on semicolon and
> eliminate leading/trailing whitespace from each token:
>
> <analyzer>
>   <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern=";"/>
>   <filter class="solr.TrimFilterFactory"/>
> </analyzer>
>
> Thanks,
> Shawn
>
>


--
Николай Хитрин