Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

pratik@semandex
Hello Everyone,

Let's say I have an analyzer which has following token stream as an output.

*token stream : [], a, ab, [], c, [], d, de, def .....*

Now let's say I want to add another filter which will drop a certain tokens
based on whether adjacent token on the right side is [] or some string.

for a given token,
     drop/replace it by empty string it if there is a non-empty string
token on its right and
     keep it if there is an empty token string on its right.

based on this, the resulting token stream would be like this.

*desired output stream : [], [a]<dropped>, ab, [], c, [], d<dropped>,
de<dropped>, def *


*Is there any Filter available in solr with which this can be achieved?*
*If writing a custom filter is the only possible option then I want to know
whether its possible to access adjacent tokens in the custom filter?*

*Any idea about this would be really helpful.*

Thanks,
Pratik
Reply | Threaded
Open this post in threaded view
|

Re: Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

Emir Arnautović
Hi Pratik,
You might be able to do some of required things using PatternReplaceChartFilter, but as you can see it does not operate on tokens level but input string. Your best bet is custom token filter. Not sure how familiar you are with how token filters work, but you have access to tokens from previous filter and you can implement any logic you want: you consume three tokens and emit tokens based on adjacent tokens.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Feb 2020, at 19:27, Pratik Patel <[hidden email]> wrote:
>
> Hello Everyone,
>
> Let's say I have an analyzer which has following token stream as an output.
>
> *token stream : [], a, ab, [], c, [], d, de, def .....*
>
> Now let's say I want to add another filter which will drop a certain tokens
> based on whether adjacent token on the right side is [] or some string.
>
> for a given token,
>     drop/replace it by empty string it if there is a non-empty string
> token on its right and
>     keep it if there is an empty token string on its right.
>
> based on this, the resulting token stream would be like this.
>
> *desired output stream : [], [a]<dropped>, ab, [], c, [], d<dropped>,
> de<dropped>, def *
>
>
> *Is there any Filter available in solr with which this can be achieved?*
> *If writing a custom filter is the only possible option then I want to know
> whether its possible to access adjacent tokens in the custom filter?*
>
> *Any idea about this would be really helpful.*
>
> Thanks,
> Pratik

Reply | Threaded
Open this post in threaded view
|

Re: Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

pratik@semandex
Thanks for the reply Emir.

I will be exploring the option of creating a custom filter. It's good to
know that we can consume more than one tokens from previous filter and emit
different number of tokens. Do you know of any existing filter in Solr
which does something similar? It would be greatly helpful to see how more
than one tokens can be consumed. I can implement my custom logic once I
have access to multiple tokens from previous filter.

Thanks
Pratik

On Mon, Feb 10, 2020 at 2:47 AM Emir Arnautović <
[hidden email]> wrote:

> Hi Pratik,
> You might be able to do some of required things using
> PatternReplaceChartFilter, but as you can see it does not operate on tokens
> level but input string. Your best bet is custom token filter. Not sure how
> familiar you are with how token filters work, but you have access to tokens
> from previous filter and you can implement any logic you want: you consume
> three tokens and emit tokens based on adjacent tokens.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 7 Feb 2020, at 19:27, Pratik Patel <[hidden email]> wrote:
> >
> > Hello Everyone,
> >
> > Let's say I have an analyzer which has following token stream as an
> output.
> >
> > *token stream : [], a, ab, [], c, [], d, de, def .....*
> >
> > Now let's say I want to add another filter which will drop a certain
> tokens
> > based on whether adjacent token on the right side is [] or some string.
> >
> > for a given token,
> >     drop/replace it by empty string it if there is a non-empty string
> > token on its right and
> >     keep it if there is an empty token string on its right.
> >
> > based on this, the resulting token stream would be like this.
> >
> > *desired output stream : [], [a]<dropped>, ab, [], c, [], d<dropped>,
> > de<dropped>, def *
> >
> >
> > *Is there any Filter available in solr with which this can be achieved?*
> > *If writing a custom filter is the only possible option then I want to
> know
> > whether its possible to access adjacent tokens in the custom filter?*
> >
> > *Any idea about this would be really helpful.*
> >
> > Thanks,
> > Pratik
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

Emir Arnautović
Hi Pratik,
Shingle filter should do that.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 10 Feb 2020, at 18:57, Pratik Patel <[hidden email]> wrote:
>
> Thanks for the reply Emir.
>
> I will be exploring the option of creating a custom filter. It's good to
> know that we can consume more than one tokens from previous filter and emit
> different number of tokens. Do you know of any existing filter in Solr
> which does something similar? It would be greatly helpful to see how more
> than one tokens can be consumed. I can implement my custom logic once I
> have access to multiple tokens from previous filter.
>
> Thanks
> Pratik
>
> On Mon, Feb 10, 2020 at 2:47 AM Emir Arnautović <
> [hidden email]> wrote:
>
>> Hi Pratik,
>> You might be able to do some of required things using
>> PatternReplaceChartFilter, but as you can see it does not operate on tokens
>> level but input string. Your best bet is custom token filter. Not sure how
>> familiar you are with how token filters work, but you have access to tokens
>> from previous filter and you can implement any logic you want: you consume
>> three tokens and emit tokens based on adjacent tokens.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 7 Feb 2020, at 19:27, Pratik Patel <[hidden email]> wrote:
>>>
>>> Hello Everyone,
>>>
>>> Let's say I have an analyzer which has following token stream as an
>> output.
>>>
>>> *token stream : [], a, ab, [], c, [], d, de, def .....*
>>>
>>> Now let's say I want to add another filter which will drop a certain
>> tokens
>>> based on whether adjacent token on the right side is [] or some string.
>>>
>>> for a given token,
>>>    drop/replace it by empty string it if there is a non-empty string
>>> token on its right and
>>>    keep it if there is an empty token string on its right.
>>>
>>> based on this, the resulting token stream would be like this.
>>>
>>> *desired output stream : [], [a]<dropped>, ab, [], c, [], d<dropped>,
>>> de<dropped>, def *
>>>
>>>
>>> *Is there any Filter available in solr with which this can be achieved?*
>>> *If writing a custom filter is the only possible option then I want to
>> know
>>> whether its possible to access adjacent tokens in the custom filter?*
>>>
>>> *Any idea about this would be really helpful.*
>>>
>>> Thanks,
>>> Pratik
>>
>>