Ref Guide - Precision & Recall of Analyzers

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Ref Guide - Precision & Recall of Analyzers

Paras Lehana
Hi Community,

In Ref Guide 8.3's *Understanding Analyzers, Tokenizers, and Filters*
<https://lucene.apache.org/solr/guide/8_3/understanding-analyzers-tokenizers-and-filters.html>
section, the text talks about precision and recall depending on how you use
analyzers during query and index time:

For indexing, you often want to simplify, or normalize, words. For example,
> setting all letters to lowercase, eliminating punctuation and accents,
> mapping words to their stems, and so on. Doing so can *increase recall *because,
> for example, "ram", "Ram" and "RAM" would all match a query for "ram". To *increase
> query-time precision*, a filter could be employed to narrow the matches
> by, for example, *ignoring all-cap acronyms* if you’re interested in male
> sheep, but not Random Access Memory.


In first case (about Recall), is it assumed that "ram" should match to all
three? *[Q1] *Because, to increase recall, we have to decrease false
negatives (documents not retrieved but are relevant). In other case (if the
three are not intended to match the query), precision is actually decreased
here (false positives are increased).

This makes sense for the second case, where precision should increase as we
are decreasing false positives (documents marked relevant wrongly).

However, the text talks about the method of "employing a filter that
ignores all-cap acronyms". How are we supposed to do that on query time?
*[Q2]* Weren't we supposed to remove filter (LCF) during the index time?


--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.
Reply | Threaded
Open this post in threaded view
|

Re: Ref Guide - Precision & Recall of Analyzers

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
I would also love to know what filter to use to ignore capitalized acronyms... which one can do this OOTB?

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
[hidden email]
 

On 11/6/19, 3:54 AM, "Paras Lehana" <[hidden email]> wrote:

    Hi Community,
   
    In Ref Guide 8.3's *Understanding Analyzers, Tokenizers, and Filters*
    <https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F3_understanding-2Danalyzers-2Dtokenizers-2Dand-2Dfilters.html&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=yEGsn7-9_UxyVA_itjyjmvW4UAAO1WE_p0rDKTnULaE&s=dmVDu9CjG_4iJDG59qtuPB4vaj8769FPo7NwGyVPc9g&e= >
    section, the text talks about precision and recall depending on how you use
    analyzers during query and index time:
   
    For indexing, you often want to simplify, or normalize, words. For example,
    > setting all letters to lowercase, eliminating punctuation and accents,
    > mapping words to their stems, and so on. Doing so can *increase recall *because,
    > for example, "ram", "Ram" and "RAM" would all match a query for "ram". To *increase
    > query-time precision*, a filter could be employed to narrow the matches
    > by, for example, *ignoring all-cap acronyms* if you’re interested in male
    > sheep, but not Random Access Memory.
   
   
    In first case (about Recall), is it assumed that "ram" should match to all
    three? *[Q1] *Because, to increase recall, we have to decrease false
    negatives (documents not retrieved but are relevant). In other case (if the
    three are not intended to match the query), precision is actually decreased
    here (false positives are increased).
   
    This makes sense for the second case, where precision should increase as we
    are decreasing false positives (documents marked relevant wrongly).
   
    However, the text talks about the method of "employing a filter that
    ignores all-cap acronyms". How are we supposed to do that on query time?
    *[Q2]* Weren't we supposed to remove filter (LCF) during the index time?
   
   
    --
    --
    Regards,
   
    *Paras Lehana* [65871]
    Development Engineer, Auto-Suggest,
    IndiaMART Intermesh Ltd.
   
    8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
    Noida, UP, IN - 201303
   
    Mob.: +91-9560911996
    Work: 01203916600 | Extn:  *8173*
   
    --
    IMPORTANT:
    NEVER share your IndiaMART OTP/ Password with anyone.
   

Reply | Threaded
Open this post in threaded view
|

Re: Ref Guide - Precision & Recall of Analyzers

Mikhail Khludnev-2
Hello, Audrey.

Can you create a regexp capturing all-caps for
https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#pattern-replace-filter
 ?

On Wed, Nov 6, 2019 at 6:36 AM Audrey Lorberfeld - [hidden email]
<[hidden email]> wrote:

> I would also love to know what filter to use to ignore capitalized
> acronyms... which one can do this OOTB?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [hidden email]
>
>
> On 11/6/19, 3:54 AM, "Paras Lehana" <[hidden email]> wrote:
>
>     Hi Community,
>
>     In Ref Guide 8.3's *Understanding Analyzers, Tokenizers, and Filters*
>     <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F3_understanding-2Danalyzers-2Dtokenizers-2Dand-2Dfilters.html&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=yEGsn7-9_UxyVA_itjyjmvW4UAAO1WE_p0rDKTnULaE&s=dmVDu9CjG_4iJDG59qtuPB4vaj8769FPo7NwGyVPc9g&e=
> >
>     section, the text talks about precision and recall depending on how
> you use
>     analyzers during query and index time:
>
>     For indexing, you often want to simplify, or normalize, words. For
> example,
>     > setting all letters to lowercase, eliminating punctuation and
> accents,
>     > mapping words to their stems, and so on. Doing so can *increase
> recall *because,
>     > for example, "ram", "Ram" and "RAM" would all match a query for
> "ram". To *increase
>     > query-time precision*, a filter could be employed to narrow the
> matches
>     > by, for example, *ignoring all-cap acronyms* if you’re interested in
> male
>     > sheep, but not Random Access Memory.
>
>
>     In first case (about Recall), is it assumed that "ram" should match to
> all
>     three? *[Q1] *Because, to increase recall, we have to decrease false
>     negatives (documents not retrieved but are relevant). In other case
> (if the
>     three are not intended to match the query), precision is actually
> decreased
>     here (false positives are increased).
>
>     This makes sense for the second case, where precision should increase
> as we
>     are decreasing false positives (documents marked relevant wrongly).
>
>     However, the text talks about the method of "employing a filter that
>     ignores all-cap acronyms". How are we supposed to do that on query
> time?
>     *[Q2]* Weren't we supposed to remove filter (LCF) during the index
> time?
>
>
>     --
>     --
>     Regards,
>
>     *Paras Lehana* [65871]
>     Development Engineer, Auto-Suggest,
>     IndiaMART Intermesh Ltd.
>
>     8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>     Noida, UP, IN - 201303
>
>     Mob.: +91-9560911996
>     Work: 01203916600 | Extn:  *8173*
>
>     --
>     IMPORTANT:
>     NEVER share your IndiaMART OTP/ Password with anyone.
>
>
>

--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Ref Guide - Precision & Recall of Analyzers

Paras Lehana
Hey Mikhail,

My doubt was regarding doing this on the query side. I think the text
probably meant adding the filter on index side then.

If willing to do this on the index side, as you suggested, we can capture
all-caps by a regex like ^[A-Z]*$. But how do we proceed? Here is what I
can think of:

   1. Capture all-caps in a copyField during index time. Replace with some
   signal like RAM -> <ACRONYM>RAM<ACRONYM>. Keep the copyField query analysis
   same and query on both the fields. In the original field, remove the
   all-caps token so that it doesn't match with any lowercase token.
   2. Mark all-caps KEYWORD if there's any method so that LowerCase after
   it doesn't work on all-caps. Use KeywordRepeat for keeping the lowercase
   token as well.
   3. Use PatternReplace to make all-caps proper acronyms (RAM -> R.A.M.)
   and use something like TypeAsPayload
   <https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-TypeAsPayloadFilter>
   to mark the token it as <ACRONYM>.

I'm still curious to find the proper way because all of my suggestions
would actually be workarounds only even if they work.

*If there's no simpler way, can we have a JIRA requirement for having a
filter that marks acronym KEYWORD so that further analysis on it doesn't
work or even better, have an argument in LowerCase (like excludeAllCaps)
which doesn't convert all-caps to lowercase? *


On Wed, 6 Nov 2019 at 20:42, Mikhail Khludnev <[hidden email]> wrote:

> Hello, Audrey.
>
> Can you create a regexp capturing all-caps for
>
> https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#pattern-replace-filter
>  ?
>
> On Wed, Nov 6, 2019 at 6:36 AM Audrey Lorberfeld -
> [hidden email]
> <[hidden email]> wrote:
>
> > I would also love to know what filter to use to ignore capitalized
> > acronyms... which one can do this OOTB?
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > [hidden email]
> >
> >
> > On 11/6/19, 3:54 AM, "Paras Lehana" <[hidden email]> wrote:
> >
> >     Hi Community,
> >
> >     In Ref Guide 8.3's *Understanding Analyzers, Tokenizers, and Filters*
> >     <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F3_understanding-2Danalyzers-2Dtokenizers-2Dand-2Dfilters.html&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=yEGsn7-9_UxyVA_itjyjmvW4UAAO1WE_p0rDKTnULaE&s=dmVDu9CjG_4iJDG59qtuPB4vaj8769FPo7NwGyVPc9g&e=
> > >
> >     section, the text talks about precision and recall depending on how
> > you use
> >     analyzers during query and index time:
> >
> >     For indexing, you often want to simplify, or normalize, words. For
> > example,
> >     > setting all letters to lowercase, eliminating punctuation and
> > accents,
> >     > mapping words to their stems, and so on. Doing so can *increase
> > recall *because,
> >     > for example, "ram", "Ram" and "RAM" would all match a query for
> > "ram". To *increase
> >     > query-time precision*, a filter could be employed to narrow the
> > matches
> >     > by, for example, *ignoring all-cap acronyms* if you’re interested
> in
> > male
> >     > sheep, but not Random Access Memory.
> >
> >
> >     In first case (about Recall), is it assumed that "ram" should match
> to
> > all
> >     three? *[Q1] *Because, to increase recall, we have to decrease false
> >     negatives (documents not retrieved but are relevant). In other case
> > (if the
> >     three are not intended to match the query), precision is actually
> > decreased
> >     here (false positives are increased).
> >
> >     This makes sense for the second case, where precision should increase
> > as we
> >     are decreasing false positives (documents marked relevant wrongly).
> >
> >     However, the text talks about the method of "employing a filter that
> >     ignores all-cap acronyms". How are we supposed to do that on query
> > time?
> >     *[Q2]* Weren't we supposed to remove filter (LCF) during the index
> > time?
> >
> >
> >     --
> >     --
> >     Regards,
> >
> >     *Paras Lehana* [65871]
> >     Development Engineer, Auto-Suggest,
> >     IndiaMART Intermesh Ltd.
> >
> >     8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >     Noida, UP, IN - 201303
> >
> >     Mob.: +91-9560911996
> >     Work: 01203916600 | Extn:  *8173*
> >
> >     --
> >     IMPORTANT:
> >     NEVER share your IndiaMART OTP/ Password with anyone.
> >
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>


--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.