Multi-lingual Search & Accent Marks

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Multi-lingual Search & Accent Marks

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Hi All,

Just wanting to test the waters here – for those of you with search engines that index multiple languages, do you use ASCII-folding in your schema? We are onboarding Spanish documents into our index right now and keep going back and forth on whether we should preserve accent marks. From our query logs, it seems people generally do not include accents when searching, but you never know…

Thank you in advance for sharing your experiences!

--
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multi-lingual Search & Accent Marks

Atita Arora
We work on german index, we neutralize accents before index i.e. umlauts to
'ae', 'ue'.. Etc and similar what we do at the query time too for an
appropriate match.

On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - [hidden email]
<[hidden email]> wrote:

> Hi All,
>
> Just wanting to test the waters here – for those of you with search
> engines that index multiple languages, do you use ASCII-folding in your
> schema? We are onboarding Spanish documents into our index right now and
> keep going back and forth on whether we should preserve accent marks. From
> our query logs, it seems people generally do not include accents when
> searching, but you never know…
>
> Thank you in advance for sharing your experiences!
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> Digital Workplace Engineering
> CIO, Finance and Operations
> IBM
> [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Aita,

Thanks for that insight!

As the conversation has progressed, we are now leaning towards not having the ASCII-folding filter in our pipelines in order to keep marks like umlauts and tildas. Instead, we might add acute and grave accents to a file pointed at by the MappingCharFilterFactory to simply strip those more common accent marks...

Any other opinions are welcome!

--
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
[hidden email]
 

On 8/30/19, 10:27 AM, "Atita Arora" <[hidden email]> wrote:

    We work on german index, we neutralize accents before index i.e. umlauts to
    'ae', 'ue'.. Etc and similar what we do at the query time too for an
    appropriate match.
   
    On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - [hidden email]
    <[hidden email]> wrote:
   
    > Hi All,
    >
    > Just wanting to test the waters here – for those of you with search
    > engines that index multiple languages, do you use ASCII-folding in your
    > schema? We are onboarding Spanish documents into our index right now and
    > keep going back and forth on whether we should preserve accent marks. From
    > our query logs, it seems people generally do not include accents when
    > searching, but you never know…
    >
    > Thank you in advance for sharing your experiences!
    >
    > --
    > Audrey Lorberfeld
    > Data Scientist, w3 Search
    > Digital Workplace Engineering
    > CIO, Finance and Operations
    > IBM
    > [hidden email]
    >
    >
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Multi-lingual Search & Accent Marks

Erick Erickson
It Depends (tm). In this case on how sophisticated/precise your users are. If your users are exclusively extremely conversant in the language and are expected to have keyboards that allow easy access to all the accents… then I might leave them in. In some cases removing them can change the meaning of a word.

That said, most installations I’ve seen remove them. They’re still present in any returned stored field so the doc looks good. And then you bypass all the nonsense about perhaps ingesting a doc that “somehow” had accents removed and/or people not putting accents in their search and the like.

MappingCFF works..

> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
>
> Aita,
>
> Thanks for that insight!
>
> As the conversation has progressed, we are now leaning towards not having the ASCII-folding filter in our pipelines in order to keep marks like umlauts and tildas. Instead, we might add acute and grave accents to a file pointed at by the MappingCharFilterFactory to simply strip those more common accent marks...
>
> Any other opinions are welcome!
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> Digital Workplace Engineering
> CIO, Finance and Operations
> IBM
> [hidden email]
>
>
> On 8/30/19, 10:27 AM, "Atita Arora" <[hidden email]> wrote:
>
>    We work on german index, we neutralize accents before index i.e. umlauts to
>    'ae', 'ue'.. Etc and similar what we do at the query time too for an
>    appropriate match.
>
>    On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - [hidden email]
>    <[hidden email]> wrote:
>
>> Hi All,
>>
>> Just wanting to test the waters here – for those of you with search
>> engines that index multiple languages, do you use ASCII-folding in your
>> schema? We are onboarding Spanish documents into our index right now and
>> keep going back and forth on whether we should preserve accent marks. From
>> our query logs, it seems people generally do not include accents when
>> searching, but you never know…
>>
>> Thank you in advance for sharing your experiences!
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> Digital Workplace Engineering
>> CIO, Finance and Operations
>> IBM
>> [hidden email]
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Multi-lingual Search & Accent Marks

Walter Underwood
The right transliteration for accents is language-dependent. In English, a diaeresis can be stripped because it is only used to mark neighboring vowels as independently pronounced. In German, the “typewriter umlaut” adds an “e”.

English: coöperate -> cooperate
German: Glück -> Glueck

Some stemmers will handle the typewriter umlauts for you. The InXight stemmers used to do that.

The English diaeresis is a fussy usage, but it does occur in text. For years, MS Word corrected “naive” to “naïve”. There may even be a curse associated with its usage.

https://www.newyorker.com/culture/culture-desk/the-curse-of-the-diaeresis

In German, there are corner cases where just stripping the umlaut changes one word into another, like schön/schon.

Isn’t language fun?

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Aug 30, 2019, at 12:48 PM, Erick Erickson <[hidden email]> wrote:
>
> It Depends (tm). In this case on how sophisticated/precise your users are. If your users are exclusively extremely conversant in the language and are expected to have keyboards that allow easy access to all the accents… then I might leave them in. In some cases removing them can change the meaning of a word.
>
> That said, most installations I’ve seen remove them. They’re still present in any returned stored field so the doc looks good. And then you bypass all the nonsense about perhaps ingesting a doc that “somehow” had accents removed and/or people not putting accents in their search and the like.
>
> MappingCFF works..
>
>> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
>>
>> Aita,
>>
>> Thanks for that insight!
>>
>> As the conversation has progressed, we are now leaning towards not having the ASCII-folding filter in our pipelines in order to keep marks like umlauts and tildas. Instead, we might add acute and grave accents to a file pointed at by the MappingCharFilterFactory to simply strip those more common accent marks...
>>
>> Any other opinions are welcome!
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> Digital Workplace Engineering
>> CIO, Finance and Operations
>> IBM
>> [hidden email]
>>
>>
>> On 8/30/19, 10:27 AM, "Atita Arora" <[hidden email]> wrote:
>>
>>   We work on german index, we neutralize accents before index i.e. umlauts to
>>   'ae', 'ue'.. Etc and similar what we do at the query time too for an
>>   appropriate match.
>>
>>   On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - [hidden email]
>>   <[hidden email]> wrote:
>>
>>> Hi All,
>>>
>>> Just wanting to test the waters here – for those of you with search
>>> engines that index multiple languages, do you use ASCII-folding in your
>>> schema? We are onboarding Spanish documents into our index right now and
>>> keep going back and forth on whether we should preserve accent marks. From
>>> our query logs, it seems people generally do not include accents when
>>> searching, but you never know…
>>>
>>> Thank you in advance for sharing your experiences!
>>>
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> Digital Workplace Engineering
>>> CIO, Finance and Operations
>>> IBM
>>> [hidden email]
>>>
>>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Multi-lingual Search & Accent Marks

Toke Eskildsen-2
In reply to this post by Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
> Just wanting to test the waters here – for those of you with search engines
> that index multiple languages, do you use ASCII-folding in your schema?

Our primary search engine is for Danish users, with sources being bibliographic records with titles and other meta data in many different languages. We normalise to Danish, meaning that most ligatures are removed, but also that letters such as Swedish ö becomes Danish ø. The rules for normalisation are dictated by Danish library practice and was implemented by a resident librarian.

Whenever we do this normalisation, we index two versions in our index: A very lightly normalised (lowercased) field and a heavily normalised field: If a record has a title "Köket" (kitchen in Swedish), we store title_orig:köket and title_norm:køket. edismax is used to ensure that both fields are searched per default (plus an explicit field alias "title" are set to point to both title_orig and title_norm for qualified searches) and that matches in title_orig has more weight for relevance calculation.

> We are onboarding Spanish documents into our index right now and keep
> going back and forth on whether we should preserve accent marks.

Going with what we do, my answer would be: Yes, do preserve and also remove :-). You could even have 3 or more levels of normalisation, depending on how much time you have for polishing.

- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

Re: Multi-lingual Search & Accent Marks

Walter Underwood
> On Aug 31, 2019, at 12:00 PM, Toke Eskildsen <[hidden email]> wrote:
>
> Whenever we do this normalisation, we index two versions in our index: A very lightly normalised (lowercased) field and a heavily normalised field: If a record has a title "Köket" (kitchen in Swedish), we store title_orig:köket and title_norm:køket. […] Going with what we do, my answer would be: Yes, do preserve and also remove :-)


Right after I posted, I realized that I wanted to say “include all” as an option. They can even be in the same field with synonyms at the same token position.

Also, don’t worry too much about creating junk terms in the index with nonsense transliterations. Terms are cheap in search indexes (up to a point). So it really is OK to have all of these indexed at the same position, even if the last one is garbage. This still has the schön/schon problem, but at least there is a match.

coöperation
cooperation
cooepoeration (typewriter umlaut version)

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
In reply to this post by Erick Erickson
Thank you, Erick!

--
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
[hidden email]
 

On 8/30/19, 3:49 PM, "Erick Erickson" <[hidden email]> wrote:

    It Depends (tm). In this case on how sophisticated/precise your users are. If your users are exclusively extremely conversant in the language and are expected to have keyboards that allow easy access to all the accents… then I might leave them in. In some cases removing them can change the meaning of a word.
   
    That said, most installations I’ve seen remove them. They’re still present in any returned stored field so the doc looks good. And then you bypass all the nonsense about perhaps ingesting a doc that “somehow” had accents removed and/or people not putting accents in their search and the like.
   
    MappingCFF works..
   
    > On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
    >
    > Aita,
    >
    > Thanks for that insight!
    >
    > As the conversation has progressed, we are now leaning towards not having the ASCII-folding filter in our pipelines in order to keep marks like umlauts and tildas. Instead, we might add acute and grave accents to a file pointed at by the MappingCharFilterFactory to simply strip those more common accent marks...
    >
    > Any other opinions are welcome!
    >
    > --
    > Audrey Lorberfeld
    > Data Scientist, w3 Search
    > Digital Workplace Engineering
    > CIO, Finance and Operations
    > IBM
    > [hidden email]
    >
    >
    > On 8/30/19, 10:27 AM, "Atita Arora" <[hidden email]> wrote:
    >
    >    We work on german index, we neutralize accents before index i.e. umlauts to
    >    'ae', 'ue'.. Etc and similar what we do at the query time too for an
    >    appropriate match.
    >
    >    On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - [hidden email]
    >    <[hidden email]> wrote:
    >
    >> Hi All,
    >>
    >> Just wanting to test the waters here – for those of you with search
    >> engines that index multiple languages, do you use ASCII-folding in your
    >> schema? We are onboarding Spanish documents into our index right now and
    >> keep going back and forth on whether we should preserve accent marks. From
    >> our query logs, it seems people generally do not include accents when
    >> searching, but you never know…
    >>
    >> Thank you in advance for sharing your experiences!
    >>
    >> --
    >> Audrey Lorberfeld
    >> Data Scientist, w3 Search
    >> Digital Workplace Engineering
    >> CIO, Finance and Operations
    >> IBM
    >> [hidden email]
    >>
    >>
    >
    >
   
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
In reply to this post by Walter Underwood
Languages are the best. Thank you all so much!

--
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
[hidden email]
 

On 8/30/19, 4:09 PM, "Walter Underwood" <[hidden email]> wrote:

    The right transliteration for accents is language-dependent. In English, a diaeresis can be stripped because it is only used to mark neighboring vowels as independently pronounced. In German, the “typewriter umlaut” adds an “e”.
   
    English: coöperate -> cooperate
    German: Glück -> Glueck
   
    Some stemmers will handle the typewriter umlauts for you. The InXight stemmers used to do that.
   
    The English diaeresis is a fussy usage, but it does occur in text. For years, MS Word corrected “naive” to “naïve”. There may even be a curse associated with its usage.
   
    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.newyorker.com_culture_culture-2Ddesk_the-2Dcurse-2Dof-2Dthe-2Ddiaeresis&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=bs1f1lhhzL5yetMSZKj0sDIC1dRXVKWJ6QfOnl6YGgo&s=cpRGRPUJXHCR3A-NyxcjzAqt-N1HevrBCjLJAW60KDU&e= 
   
    In German, there are corner cases where just stripping the umlaut changes one word into another, like schön/schon.
   
    Isn’t language fun?
   
    wunder
    Walter Underwood
    [hidden email]
    https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=bs1f1lhhzL5yetMSZKj0sDIC1dRXVKWJ6QfOnl6YGgo&s=JKCjwue0SDlu5UZ5sllEI__txfMvrugOL51CIAPV1H8&e=   (my blog)
   
    > On Aug 30, 2019, at 12:48 PM, Erick Erickson <[hidden email]> wrote:
    >
    > It Depends (tm). In this case on how sophisticated/precise your users are. If your users are exclusively extremely conversant in the language and are expected to have keyboards that allow easy access to all the accents… then I might leave them in. In some cases removing them can change the meaning of a word.
    >
    > That said, most installations I’ve seen remove them. They’re still present in any returned stored field so the doc looks good. And then you bypass all the nonsense about perhaps ingesting a doc that “somehow” had accents removed and/or people not putting accents in their search and the like.
    >
    > MappingCFF works..
    >
    >> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
    >>
    >> Aita,
    >>
    >> Thanks for that insight!
    >>
    >> As the conversation has progressed, we are now leaning towards not having the ASCII-folding filter in our pipelines in order to keep marks like umlauts and tildas. Instead, we might add acute and grave accents to a file pointed at by the MappingCharFilterFactory to simply strip those more common accent marks...
    >>
    >> Any other opinions are welcome!
    >>
    >> --
    >> Audrey Lorberfeld
    >> Data Scientist, w3 Search
    >> Digital Workplace Engineering
    >> CIO, Finance and Operations
    >> IBM
    >> [hidden email]
    >>
    >>
    >> On 8/30/19, 10:27 AM, "Atita Arora" <[hidden email]> wrote:
    >>
    >>   We work on german index, we neutralize accents before index i.e. umlauts to
    >>   'ae', 'ue'.. Etc and similar what we do at the query time too for an
    >>   appropriate match.
    >>
    >>   On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - [hidden email]
    >>   <[hidden email]> wrote:
    >>
    >>> Hi All,
    >>>
    >>> Just wanting to test the waters here – for those of you with search
    >>> engines that index multiple languages, do you use ASCII-folding in your
    >>> schema? We are onboarding Spanish documents into our index right now and
    >>> keep going back and forth on whether we should preserve accent marks. From
    >>> our query logs, it seems people generally do not include accents when
    >>> searching, but you never know…
    >>>
    >>> Thank you in advance for sharing your experiences!
    >>>
    >>> --
    >>> Audrey Lorberfeld
    >>> Data Scientist, w3 Search
    >>> Digital Workplace Engineering
    >>> CIO, Finance and Operations
    >>> IBM
    >>> [hidden email]
    >>>
    >>>
    >>
    >>
    >
   
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
In reply to this post by Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Toke,

Do you find that searching over both the original title field and the normalized title field increases the time it takes for your search engine to retrieve results?

--
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
[hidden email]
 

On 8/31/19, 3:01 PM, "Toke Eskildsen" <[hidden email]> wrote:

    Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
    > Just wanting to test the waters here – for those of you with search engines
    > that index multiple languages, do you use ASCII-folding in your schema?
   
    Our primary search engine is for Danish users, with sources being bibliographic records with titles and other meta data in many different languages. We normalise to Danish, meaning that most ligatures are removed, but also that letters such as Swedish ö becomes Danish ø. The rules for normalisation are dictated by Danish library practice and was implemented by a resident librarian.
   
    Whenever we do this normalisation, we index two versions in our index: A very lightly normalised (lowercased) field and a heavily normalised field: If a record has a title "Köket" (kitchen in Swedish), we store title_orig:köket and title_norm:køket. edismax is used to ensure that both fields are searched per default (plus an explicit field alias "title" are set to point to both title_orig and title_norm for qualified searches) and that matches in title_orig has more weight for relevance calculation.
   
    > We are onboarding Spanish documents into our index right now and keep
    > going back and forth on whether we should preserve accent marks.
   
    Going with what we do, my answer would be: Yes, do preserve and also remove :-). You could even have 3 or more levels of normalisation, depending on how much time you have for polishing.
   
    - Toke Eskildsen
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Multi-lingual Search & Accent Marks

Toke Eskildsen-2
Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
> Do you find that searching over both the original title field and the normalized title
> field increases the time it takes for your search engine to retrieve results?

It is not something we have measured as that index is fast enough (which in this context means that we're practically always waiting for the result from an external service that is issued in parallel with the call to our Solr server).

Technically it's not different from searching across other fields defined in the eDismax setup, so I guess it boils down to "how many fields can you afford to search across?", where our organization's default answer is "as many as we need to get quality matches. Make it work Toke, chop chop". On a more serious note, it is not something I would worry about unless we're talking some special high-performance setup with a budget for tuning: Matching terms and joining filters is core Solr (Lucene really) functionality. Plain query & filter-matching time tend to be dwarfed by aggregations (grouping, faceting, stats).

- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
In reply to this post by Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Toke,

Thank you! That makes a lot of sense.

In other news -- we just had a meeting where we decided to try out a hybrid strategy. I'd love to know what you & everyone else thinks...

- Since we are concerned with the overhead created by "double-fielding" all tokens per language (because I'm not sure how we'd work the logic into Solr to only double-field when an accent is present), we are going to try to do something along the lines of synonym-expansion:
        - We are going to build a custom plugin that detects diacritics -- upon detection, the plugin would expand the token to both its original form and its ascii-folded term (a la Toke's approach).
        - However, since we are doing it in a way that mimics synonym expansion, we are going to keep both terms in a single field

The main issue we are anticipating with the above strategy surrounds scoring. Since we will be increasing the frequency of accented terms, we might bias our page ranker...

Has anyone done anything similar (and/or does anyone think this idea is totally the wrong way to go?)

Best,
Audrey

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
[hidden email]
 

On 9/3/19, 2:58 PM, "Toke Eskildsen" <[hidden email]> wrote:

    Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
    > Do you find that searching over both the original title field and the normalized title
    > field increases the time it takes for your search engine to retrieve results?
   
    It is not something we have measured as that index is fast enough (which in this context means that we're practically always waiting for the result from an external service that is issued in parallel with the call to our Solr server).
   
    Technically it's not different from searching across other fields defined in the eDismax setup, so I guess it boils down to "how many fields can you afford to search across?", where our organization's default answer is "as many as we need to get quality matches. Make it work Toke, chop chop". On a more serious note, it is not something I would worry about unless we're talking some special high-performance setup with a budget for tuning: Matching terms and joining filters is core Solr (Lucene really) functionality. Plain query & filter-matching time tend to be dwarfed by aggregations (grouping, faceting, stats).
   
    - Toke Eskildsen
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Multi-lingual Search & Accent Marks

Alexandre Rafalovitch
What about combining:
1) KeywordRepeatFilterFactory
2) An existing folding filter (need to check it ignores Keyword marked word)
3) RemoveDuplicatesTokenFilterFactory

That may give what you are after without custom coding.

Regards,
   Alex.

On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld -
[hidden email] <[hidden email]> wrote:

>
> Toke,
>
> Thank you! That makes a lot of sense.
>
> In other news -- we just had a meeting where we decided to try out a hybrid strategy. I'd love to know what you & everyone else thinks...
>
> - Since we are concerned with the overhead created by "double-fielding" all tokens per language (because I'm not sure how we'd work the logic into Solr to only double-field when an accent is present), we are going to try to do something along the lines of synonym-expansion:
>         - We are going to build a custom plugin that detects diacritics -- upon detection, the plugin would expand the token to both its original form and its ascii-folded term (a la Toke's approach).
>         - However, since we are doing it in a way that mimics synonym expansion, we are going to keep both terms in a single field
>
> The main issue we are anticipating with the above strategy surrounds scoring. Since we will be increasing the frequency of accented terms, we might bias our page ranker...
>
> Has anyone done anything similar (and/or does anyone think this idea is totally the wrong way to go?)
>
> Best,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [hidden email]
>
>
> On 9/3/19, 2:58 PM, "Toke Eskildsen" <[hidden email]> wrote:
>
>     Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
>     > Do you find that searching over both the original title field and the normalized title
>     > field increases the time it takes for your search engine to retrieve results?
>
>     It is not something we have measured as that index is fast enough (which in this context means that we're practically always waiting for the result from an external service that is issued in parallel with the call to our Solr server).
>
>     Technically it's not different from searching across other fields defined in the eDismax setup, so I guess it boils down to "how many fields can you afford to search across?", where our organization's default answer is "as many as we need to get quality matches. Make it work Toke, chop chop". On a more serious note, it is not something I would worry about unless we're talking some special high-performance setup with a budget for tuning: Matching terms and joining filters is core Solr (Lucene really) functionality. Plain query & filter-matching time tend to be dwarfed by aggregations (grouping, faceting, stats).
>
>     - Toke Eskildsen
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Re: Multi-lingual Search & Accent Marks

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Thanks, Alex! We'll look into this.

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
[hidden email]
 

On 9/3/19, 4:27 PM, "Alexandre Rafalovitch" <[hidden email]> wrote:

    What about combining:
    1) KeywordRepeatFilterFactory
    2) An existing folding filter (need to check it ignores Keyword marked word)
    3) RemoveDuplicatesTokenFilterFactory
   
    That may give what you are after without custom coding.
   
    Regards,
       Alex.
   
    On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld -
    [hidden email] <[hidden email]> wrote:
    >
    > Toke,
    >
    > Thank you! That makes a lot of sense.
    >
    > In other news -- we just had a meeting where we decided to try out a hybrid strategy. I'd love to know what you & everyone else thinks...
    >
    > - Since we are concerned with the overhead created by "double-fielding" all tokens per language (because I'm not sure how we'd work the logic into Solr to only double-field when an accent is present), we are going to try to do something along the lines of synonym-expansion:
    >         - We are going to build a custom plugin that detects diacritics -- upon detection, the plugin would expand the token to both its original form and its ascii-folded term (a la Toke's approach).
    >         - However, since we are doing it in a way that mimics synonym expansion, we are going to keep both terms in a single field
    >
    > The main issue we are anticipating with the above strategy surrounds scoring. Since we will be increasing the frequency of accented terms, we might bias our page ranker...
    >
    > Has anyone done anything similar (and/or does anyone think this idea is totally the wrong way to go?)
    >
    > Best,
    > Audrey
    >
    > --
    > Audrey Lorberfeld
    > Data Scientist, w3 Search
    > IBM
    > [hidden email]
    >
    >
    > On 9/3/19, 2:58 PM, "Toke Eskildsen" <[hidden email]> wrote:
    >
    >     Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
    >     > Do you find that searching over both the original title field and the normalized title
    >     > field increases the time it takes for your search engine to retrieve results?
    >
    >     It is not something we have measured as that index is fast enough (which in this context means that we're practically always waiting for the result from an external service that is issued in parallel with the call to our Solr server).
    >
    >     Technically it's not different from searching across other fields defined in the eDismax setup, so I guess it boils down to "how many fields can you afford to search across?", where our organization's default answer is "as many as we need to get quality matches. Make it work Toke, chop chop". On a more serious note, it is not something I would worry about unless we're talking some special high-performance setup with a budget for tuning: Matching terms and joining filters is core Solr (Lucene really) functionality. Plain query & filter-matching time tend to be dwarfed by aggregations (grouping, faceting, stats).
    >
    >     - Toke Eskildsen
    >
    >
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Multi-lingual Search & Accent Marks

Walter Underwood
In reply to this post by Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
On Sep 3, 2019, at 1:13 PM, Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
>
> The main issue we are anticipating with the above strategy surrounds scoring. Since we will be increasing the frequency of accented terms, we might bias our page ranker...

You will not be increasing the frequency of the accented terms. Those frequencies will stay the same. You’ll be adding new unaccented terms. The new terms will probably have higher frequencies than the accented terms. If so, the accented terms should be preferred for accented queries. You might or might not want that behavior.

doc1: glück
doc1 terms: glück, gluck, glueck

doc2: glueck
doc2 terms: glueck

df for glück: 1
df for gluck: 1
df for glueck: 2

The df for the term “glück” is the same whether you expand or not.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)