AlphaNumeric analyzer/tokenizer

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

AlphaNumeric analyzer/tokenizer

Abhishek Chauhan
Hi,

We have been using SimpleAnalyzer which keeps only letters in its tokens.
This limits us to search in strings that contains both letters and numbers.
For e.g. "axt1234". SimpleAnalyzer would only enable us to search for "axt"
successfully, but search strings like "axt1", "axt123" etc would give no
results because while indexing it ignored the numbers.

I can use StandardAnalyzer or WhitespaceAnalyzer but I want to tokenize on
underscores also
which these analyzers don't do. I have also looked at WordDelimiterFilter
which will split "axt1234" into "axt" and "1234". However, using this also,
I cannot search for "axt12" etc.

Is there something like an Alphanumeric analyzer which would be very
similar to SimpleAnalzyer but in addition to letters it would also keep
digits in its tokens? I am willing contribute such an analyzer if one is
not available.

Thanks and Regards,
Abhishek
Reply | Threaded
Open this post in threaded view
|

RE: AlphaNumeric analyzer/tokenizer

Uwe Schindler
Hi,

The easiest is to use PatternTokenizer as part of your analyzer. It uses a regular expression to split words. Just use some regular expression that matches unicode ranges for numbers and digits.

To build your Analyzer use the class CustomAnalyzer and its builder API to construct your own analysis chain. User PatternTokenizerFactory as tokenizer and add stuff like LowercaseFilterFactory and you are done. No need for any new components in Lucene. It's all there, RTFM 😊

https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizerFactory.html (the example there is for Apache Solr, but you can use the same parameter names in CustomAnalyzer)

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Abhishek Chauhan <[hidden email]>
> Sent: Friday, August 16, 2019 11:23 AM
> To: [hidden email]
> Subject: AlphaNumeric analyzer/tokenizer
>
> Hi,
>
> We have been using SimpleAnalyzer which keeps only letters in its tokens.
> This limits us to search in strings that contains both letters and numbers.
> For e.g. "axt1234". SimpleAnalyzer would only enable us to search for "axt"
> successfully, but search strings like "axt1", "axt123" etc would give no
> results because while indexing it ignored the numbers.
>
> I can use StandardAnalyzer or WhitespaceAnalyzer but I want to tokenize on
> underscores also
> which these analyzers don't do. I have also looked at WordDelimiterFilter
> which will split "axt1234" into "axt" and "1234". However, using this also,
> I cannot search for "axt12" etc.
>
> Is there something like an Alphanumeric analyzer which would be very
> similar to SimpleAnalzyer but in addition to letters it would also keep
> digits in its tokens? I am willing contribute such an analyzer if one is
> not available.
>
> Thanks and Regards,
> Abhishek


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AlphaNumeric analyzer/tokenizer

Abhishek Chauhan
In reply to this post by Abhishek Chauhan
Hi,

Can someone please check the above mail and provide some feedback?

Thanks and Regards,
Abhishek

On Fri, Aug 16, 2019 at 2:52 PM Abhishek Chauhan <
[hidden email]> wrote:

> Hi,
>
> We have been using SimpleAnalyzer which keeps only letters in its tokens.
> This limits us to search in strings that contains both letters and numbers.
> For e.g. "axt1234". SimpleAnalyzer would only enable us to search for "axt"
> successfully, but search strings like "axt1", "axt123" etc would give no
> results because while indexing it ignored the numbers.
>
> I can use StandardAnalyzer or WhitespaceAnalyzer but I want to tokenize on
> underscores also
> which these analyzers don't do. I have also looked at WordDelimiterFilter
> which will split "axt1234" into "axt" and "1234". However, using this also,
> I cannot search for "axt12" etc.
>
> Is there something like an Alphanumeric analyzer which would be very
> similar to SimpleAnalzyer but in addition to letters it would also keep
> digits in its tokens? I am willing contribute such an analyzer if one is
> not available.
>
> Thanks and Regards,
> Abhishek
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: AlphaNumeric analyzer/tokenizer

Uwe Schindler
You already got many responses. Check you inbox.

Uwe

Am August 19, 2019 6:23:20 AM UTC schrieb Abhishek Chauhan <[hidden email]>:

>Hi,
>
>Can someone please check the above mail and provide some feedback?
>
>Thanks and Regards,
>Abhishek
>
>On Fri, Aug 16, 2019 at 2:52 PM Abhishek Chauhan <
>[hidden email]> wrote:
>
>> Hi,
>>
>> We have been using SimpleAnalyzer which keeps only letters in its
>tokens.
>> This limits us to search in strings that contains both letters and
>numbers.
>> For e.g. "axt1234". SimpleAnalyzer would only enable us to search for
>"axt"
>> successfully, but search strings like "axt1", "axt123" etc would give
>no
>> results because while indexing it ignored the numbers.
>>
>> I can use StandardAnalyzer or WhitespaceAnalyzer but I want to
>tokenize on
>> underscores also
>> which these analyzers don't do. I have also looked at
>WordDelimiterFilter
>> which will split "axt1234" into "axt" and "1234". However, using this
>also,
>> I cannot search for "axt12" etc.
>>
>> Is there something like an Alphanumeric analyzer which would be very
>> similar to SimpleAnalzyer but in addition to letters it would also
>keep
>> digits in its tokens? I am willing contribute such an analyzer if one
>is
>> not available.
>>
>> Thanks and Regards,
>> Abhishek
>>
>>
>>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Reply | Threaded
Open this post in threaded view
|

Re: AlphaNumeric analyzer/tokenizer

Martin Grigorov
Hi,


On Mon, Aug 19, 2019 at 9:31 AM Uwe Schindler <[hidden email]> wrote:

> You already got many responses. Check you inbox.
>

"many" made me think that I've also missed something.
https://markmail.org/message/ohv5qcvxilj3n3fb


>
> Uwe
>
> Am August 19, 2019 6:23:20 AM UTC schrieb Abhishek Chauhan <
> [hidden email]>:
> >Hi,
> >
> >Can someone please check the above mail and provide some feedback?
> >
> >Thanks and Regards,
> >Abhishek
> >
> >On Fri, Aug 16, 2019 at 2:52 PM Abhishek Chauhan <
> >[hidden email]> wrote:
> >
> >> Hi,
> >>
> >> We have been using SimpleAnalyzer which keeps only letters in its
> >tokens.
> >> This limits us to search in strings that contains both letters and
> >numbers.
> >> For e.g. "axt1234". SimpleAnalyzer would only enable us to search for
> >"axt"
> >> successfully, but search strings like "axt1", "axt123" etc would give
> >no
> >> results because while indexing it ignored the numbers.
> >>
> >> I can use StandardAnalyzer or WhitespaceAnalyzer but I want to
> >tokenize on
> >> underscores also
> >> which these analyzers don't do. I have also looked at
> >WordDelimiterFilter
> >> which will split "axt1234" into "axt" and "1234". However, using this
> >also,
> >> I cannot search for "axt12" etc.
> >>
> >> Is there something like an Alphanumeric analyzer which would be very
> >> similar to SimpleAnalzyer but in addition to letters it would also
> >keep
> >> digits in its tokens? I am willing contribute such an analyzer if one
> >is
> >> not available.
> >>
> >> Thanks and Regards,
> >> Abhishek
> >>
> >>
> >>
>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de