ICUFoldingFilter

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

ICUFoldingFilter

Michael Sokolov-4
Hi, I'm using ICUFoldingFilter and for the most part it does exactly what I
want. However there are some behaviors I'd like to tweak. For example it
maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that, and
whether there is any way to prevent it.

I spent a little time with
http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I guess
is the basis for what this filter does (it's referenced in the javadocs),
but that didn't answer my questions. As an aside, it seems this tech report
was withdfrawn by the unicode consortium? Not sure what that means if
anything, but it seems ominous.

Anyway, I would appreciate pointers to more info, and specifically, whether
there are any alternatives to the utr30.nrm data file, or any possibility
to select among the many transformations this filter applies.

Thanks!

Mike S
Reply | Threaded
Open this post in threaded view
|

Re: ICUFoldingFilter

Robert Muir
This cannot be "tweaked" at runtime, it is implemented as custom normalization.

You can modify the sources / build your own ruleset or use a different
tokenfilter to normalize characters.

On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <[hidden email]> wrote:

> Hi, I'm using ICUFoldingFilter and for the most part it does exactly what I
> want. However there are some behaviors I'd like to tweak. For example it
> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that, and
> whether there is any way to prevent it.
>
> I spent a little time with
> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I guess
> is the basis for what this filter does (it's referenced in the javadocs),
> but that didn't answer my questions. As an aside, it seems this tech report
> was withdfrawn by the unicode consortium? Not sure what that means if
> anything, but it seems ominous.
>
> Anyway, I would appreciate pointers to more info, and specifically, whether
> there are any alternatives to the utr30.nrm data file, or any possibility
> to select among the many transformations this filter applies.
>
> Thanks!
>
> Mike S

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ICUFoldingFilter

Robert Muir
actually, you now can choose to ignore certain characters by using
unicode filtering mechanism.

This was added in https://issues.apache.org/jira/browse/LUCENE-8129

So apply a filter such as [^\^] and the filter will ignore ^.

On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <[hidden email]> wrote:

> This cannot be "tweaked" at runtime, it is implemented as custom normalization.
>
> You can modify the sources / build your own ruleset or use a different
> tokenfilter to normalize characters.
>
> On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <[hidden email]> wrote:
>> Hi, I'm using ICUFoldingFilter and for the most part it does exactly what I
>> want. However there are some behaviors I'd like to tweak. For example it
>> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that, and
>> whether there is any way to prevent it.
>>
>> I spent a little time with
>> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I guess
>> is the basis for what this filter does (it's referenced in the javadocs),
>> but that didn't answer my questions. As an aside, it seems this tech report
>> was withdfrawn by the unicode consortium? Not sure what that means if
>> anything, but it seems ominous.
>>
>> Anyway, I would appreciate pointers to more info, and specifically, whether
>> there are any alternatives to the utr30.nrm data file, or any possibility
>> to select among the many transformations this filter applies.
>>
>> Thanks!
>>
>> Mike S

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ICUFoldingFilter

Michael Sokolov-4
Ah thanks! That's very good to know. As it is I realized we already have an
earlier component where we can handle this (we have a custom ICUTokenizer
rbbi and can just split on "^"). So many flexibility

-Mike

On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir <[hidden email]> wrote:

> actually, you now can choose to ignore certain characters by using
> unicode filtering mechanism.
>
> This was added in https://issues.apache.org/jira/browse/LUCENE-8129
>
> So apply a filter such as [^\^] and the filter will ignore ^.
>
> On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <[hidden email]> wrote:
> > This cannot be "tweaked" at runtime, it is implemented as custom
> normalization.
> >
> > You can modify the sources / build your own ruleset or use a different
> > tokenfilter to normalize characters.
> >
> > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <[hidden email]>
> wrote:
> >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly
> what I
> >> want. However there are some behaviors I'd like to tweak. For example it
> >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that,
> and
> >> whether there is any way to prevent it.
> >>
> >> I spent a little time with
> >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I
> guess
> >> is the basis for what this filter does (it's referenced in the
> javadocs),
> >> but that didn't answer my questions. As an aside, it seems this tech
> report
> >> was withdfrawn by the unicode consortium? Not sure what that means if
> >> anything, but it seems ominous.
> >>
> >> Anyway, I would appreciate pointers to more info, and specifically,
> whether
> >> there are any alternatives to the utr30.nrm data file, or any
> possibility
> >> to select among the many transformations this filter applies.
> >>
> >> Thanks!
> >>
> >> Mike S
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ICUFoldingFilter

Robert Muir
There may be a traps, e.g. if you make such a filter with UnicodeSet,
I think you really need to call .freeze() before passing it to this
thing. I have not examined the sources in a while but I think this
might be similar to "compiling a regexp" in that you'll then get good
performance when its later called millions of times.

If you use the factories, it will do this for you. But if you use the
API directly it is currently a bit of a performance trap...

On Mon, Jun 4, 2018 at 2:49 PM, Michael Sokolov <[hidden email]> wrote:

> Ah thanks! That's very good to know. As it is I realized we already have an
> earlier component where we can handle this (we have a custom ICUTokenizer
> rbbi and can just split on "^"). So many flexibility
>
> -Mike
>
> On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir <[hidden email]> wrote:
>
>> actually, you now can choose to ignore certain characters by using
>> unicode filtering mechanism.
>>
>> This was added in https://issues.apache.org/jira/browse/LUCENE-8129
>>
>> So apply a filter such as [^\^] and the filter will ignore ^.
>>
>> On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <[hidden email]> wrote:
>> > This cannot be "tweaked" at runtime, it is implemented as custom
>> normalization.
>> >
>> > You can modify the sources / build your own ruleset or use a different
>> > tokenfilter to normalize characters.
>> >
>> > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <[hidden email]>
>> wrote:
>> >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly
>> what I
>> >> want. However there are some behaviors I'd like to tweak. For example it
>> >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that,
>> and
>> >> whether there is any way to prevent it.
>> >>
>> >> I spent a little time with
>> >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I
>> guess
>> >> is the basis for what this filter does (it's referenced in the
>> javadocs),
>> >> but that didn't answer my questions. As an aside, it seems this tech
>> report
>> >> was withdfrawn by the unicode consortium? Not sure what that means if
>> >> anything, but it seems ominous.
>> >>
>> >> Anyway, I would appreciate pointers to more info, and specifically,
>> whether
>> >> there are any alternatives to the utr30.nrm data file, or any
>> possibility
>> >> to select among the many transformations this filter applies.
>> >>
>> >> Thanks!
>> >>
>> >> Mike S
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ICUFoldingFilter

Michael Sokolov-4
That's good to know. If we go this route, we'll definitely either use the
factory, or follow its example. Thanks again

-Mike

On Mon, Jun 4, 2018 at 9:12 PM, Robert Muir <[hidden email]> wrote:

> There may be a traps, e.g. if you make such a filter with UnicodeSet,
> I think you really need to call .freeze() before passing it to this
> thing. I have not examined the sources in a while but I think this
> might be similar to "compiling a regexp" in that you'll then get good
> performance when its later called millions of times.
>
> If you use the factories, it will do this for you. But if you use the
> API directly it is currently a bit of a performance trap...
>
> On Mon, Jun 4, 2018 at 2:49 PM, Michael Sokolov <[hidden email]>
> wrote:
> > Ah thanks! That's very good to know. As it is I realized we already have
> an
> > earlier component where we can handle this (we have a custom ICUTokenizer
> > rbbi and can just split on "^"). So many flexibility
> >
> > -Mike
> >
> > On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir <[hidden email]> wrote:
> >
> >> actually, you now can choose to ignore certain characters by using
> >> unicode filtering mechanism.
> >>
> >> This was added in https://issues.apache.org/jira/browse/LUCENE-8129
> >>
> >> So apply a filter such as [^\^] and the filter will ignore ^.
> >>
> >> On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <[hidden email]> wrote:
> >> > This cannot be "tweaked" at runtime, it is implemented as custom
> >> normalization.
> >> >
> >> > You can modify the sources / build your own ruleset or use a different
> >> > tokenfilter to normalize characters.
> >> >
> >> > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <[hidden email]>
> >> wrote:
> >> >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly
> >> what I
> >> >> want. However there are some behaviors I'd like to tweak. For
> example it
> >> >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does
> that,
> >> and
> >> >> whether there is any way to prevent it.
> >> >>
> >> >> I spent a little time with
> >> >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I
> >> guess
> >> >> is the basis for what this filter does (it's referenced in the
> >> javadocs),
> >> >> but that didn't answer my questions. As an aside, it seems this tech
> >> report
> >> >> was withdfrawn by the unicode consortium? Not sure what that means if
> >> >> anything, but it seems ominous.
> >> >>
> >> >> Anyway, I would appreciate pointers to more info, and specifically,
> >> whether
> >> >> there are any alternatives to the utr30.nrm data file, or any
> >> possibility
> >> >> to select among the many transformations this filter applies.
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Mike S
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>