Skewed IDF in multi lingual index, again

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Skewed IDF in multi lingual index, again

Markus Jelsma-2
Hello,

We already discussed this problem five years ago [1]. In short: documents in foreign languages are scored higher for some terms.

It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well! But, probably due to index changes, the problem is back for some terms, mostly proper nouns, well, just like five years ago.

We already deboost documents by 0.7 that are not in the user's preference language but in some cases it is not enough. I can go on by reducing that boost but that's not what i prefer.

I'd like to know if there are additional tricks to solve the problem.

Many thanks!
Markus

[1] http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

Walter Underwood
I’ve occasionally considered using Unicode language tags (U+E001 and friends) on each term. That would make a term specific to a language, so we would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big hammer, because it restricts matches to the same language. If the entire document is in one language, might as well use a filter query for that language. The tags would work for multiple languages in one document.

Maybe make the untagged term a synonym. For cross-language terms like “LaserJet”, the untagged one would have worse idf.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


> On Nov 30, 2017, at 8:14 AM, Markus Jelsma <[hidden email]> wrote:
>
> Hello,
>
> We already discussed this problem five years ago [1]. In short: documents in foreign languages are scored higher for some terms.
>
> It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well! But, probably due to index changes, the problem is back for some terms, mostly proper nouns, well, just like five years ago.
>
> We already deboost documents by 0.7 that are not in the user's preference language but in some cases it is not enough. I can go on by reducing that boost but that's not what i prefer.
>
> I'd like to know if there are additional tricks to solve the problem.
>
> Many thanks!
> Markus
>
> [1] http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html

Reply | Threaded
Open this post in threaded view
|

RE: Skewed IDF in multi lingual index, again

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
This is unfortunately not what we want. Some customers use filters to restrict language, but some customers don't. They want to be able to find documents in all languages, so we use user preference to get their local language on top. Except for very relevant documents in foreign languages, hence the deboost is not too low.

Thanks,
Markus

 
-----Original message-----

> From:Walter Underwood <[hidden email]>
> Sent: Thursday 30th November 2017 17:29
> To: [hidden email]
> Subject: Re: Skewed IDF in multi lingual index, again
>
> I’ve occasionally considered using Unicode language tags (U+E001 and friends) on each term. That would make a term specific to a language, so we would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big hammer, because it restricts matches to the same language. If the entire document is in one language, might as well use a filter query for that language. The tags would work for multiple languages in one document.
>
> Maybe make the untagged term a synonym. For cross-language terms like “LaserJet”, the untagged one would have worse idf.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Nov 30, 2017, at 8:14 AM, Markus Jelsma <[hidden email]> wrote:
> >
> > Hello,
> >
> > We already discussed this problem five years ago [1]. In short: documents in foreign languages are scored higher for some terms.
> >
> > It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well! But, probably due to index changes, the problem is back for some terms, mostly proper nouns, well, just like five years ago.
> >
> > We already deboost documents by 0.7 that are not in the user's preference language but in some cases it is not enough. I can go on by reducing that boost but that's not what i prefer.
> >
> > I'd like to know if there are additional tricks to solve the problem.
> >
> > Many thanks!
> > Markus
> >
> > [1] http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

Walter Underwood
Expanding the query to use both the tagged and untagged term might work. I’m not sure the effect would be a lot different than boosting the preferred language.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


> On Nov 30, 2017, at 8:35 AM, Markus Jelsma <[hidden email]> wrote:
>
> This is unfortunately not what we want. Some customers use filters to restrict language, but some customers don't. They want to be able to find documents in all languages, so we use user preference to get their local language on top. Except for very relevant documents in foreign languages, hence the deboost is not too low.
>
> Thanks,
> Markus
>
>
> -----Original message-----
>> From:Walter Underwood <[hidden email]>
>> Sent: Thursday 30th November 2017 17:29
>> To: [hidden email]
>> Subject: Re: Skewed IDF in multi lingual index, again
>>
>> I’ve occasionally considered using Unicode language tags (U+E001 and friends) on each term. That would make a term specific to a language, so we would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big hammer, because it restricts matches to the same language. If the entire document is in one language, might as well use a filter query for that language. The tags would work for multiple languages in one document.
>>
>> Maybe make the untagged term a synonym. For cross-language terms like “LaserJet”, the untagged one would have worse idf.
>>
>> wunder
>> Walter Underwood
>> [hidden email]
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Nov 30, 2017, at 8:14 AM, Markus Jelsma <[hidden email]> wrote:
>>>
>>> Hello,
>>>
>>> We already discussed this problem five years ago [1]. In short: documents in foreign languages are scored higher for some terms.
>>>
>>> It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well! But, probably due to index changes, the problem is back for some terms, mostly proper nouns, well, just like five years ago.
>>>
>>> We already deboost documents by 0.7 that are not in the user's preference language but in some cases it is not enough. I can go on by reducing that boost but that's not what i prefer.
>>>
>>> I'd like to know if there are additional tricks to solve the problem.
>>>
>>> Many thanks!
>>> Markus
>>>
>>> [1] http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

alessandro.benedetti
In reply to this post by Markus Jelsma-2
Hi Markus,
just out of interest, why did
" It was solved back then by using docCount instead of maxDoc when
calculating idf, it worked really well!" solve the problem ?

i assume you are using different fields, one per language.
Each field is appearing on a different number of docs I guess.
e.g.
text_en -> 10000 docs
text_fr -> 1000 docs
text_it -> 500 docs

the reason docCount was improving things is because it was using a docCount
relative to a specific field while maxDoc is global all over the index ?







-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

alessandro.benedetti
Furthermore, taking a look to the code for BM25 similarity, it seems to me it
is currently working right :
- docCount is used per field if != -1


/**
   * Computes a score factor for a simple term and returns an explanation
   * for that score factor.
   *
   * <p>
   * The default implementation uses:
   *
   * <pre class="prettyprint">
   * idf(docFreq, docCount);
   * </pre>
   *
   * Note that {@link CollectionStatistics#docCount()} is used instead of
   * {@link org.apache.lucene.index.IndexReader#numDocs()
IndexReader#numDocs()} because also
   * {@link TermStatistics#docFreq()} is used, and when the latter
   * is inaccurate, so is {@link CollectionStatistics#docCount()}, and in
the same direction.
   * In addition, {@link CollectionStatistics#docCount()} does not skew when
fields are sparse.
   *  
   * @param collectionStats collection-level statistics
   * @param termStats term-level statistics for the term
   * @return an Explain object that includes both an idf score factor
             and an explanation for the term.
   */
  public Explanation idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats) {
    final long df = termStats.docFreq();
    final long docCount = collectionStats.docCount() == -1 ?
collectionStats.maxDoc() : collectionStats.docCount();
    final float idf = idf(df, docCount);
    return Explanation.match(idf, "idf, computed as log(1 + (docCount -
docFreq + 0.5) / (docFreq + 0.5)) from:",
        Explanation.match(df, "docFreq"),
        Explanation.match(docCount, "docCount"));
  }



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

Shawn Heisey-2
In reply to this post by alessandro.benedetti
On 12/4/2017 7:21 AM, alessandro.benedetti wrote:
> the reason docCount was improving things is because it was using a docCount
> relative to a specific field while maxDoc is global all over the index ?

Lucene/Solr doesn't actually delete documents when you delete them, it
just marks them as deleted.  I'm pretty sure that the difference between
docCount and maxDoc is deleted documents.  Maybe I don't understand what
I'm talking about, but that is the best I can come up with.

Not all aspects of the impact on scores from deleted documents can be
eliminated, but there has been some effort to make it as minimal as
possible.  For what has been described here, the actual count is
available, so it gets used.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

Yonik Seeley
On Mon, Dec 4, 2017 at 1:35 PM, Shawn Heisey <[hidden email]> wrote:
> I'm pretty sure that the difference between docCount and maxDoc is deleted documents.

docCount (not the best name) here is the number of documents with the
field being searched.  docFreq (df) is the number of documents
actually containing the term in that field.
In the past, maxDoc was used instead of docCount.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

alessandro.benedetti
"Lucene/Solr doesn't actually delete documents when you delete them, it
just marks them as deleted.  I'm pretty sure that the difference between
docCount and maxDoc is deleted documents.  Maybe I don't understand what
I'm talking about, but that is the best I can come up with. "

Thanks Shawn, yes, that is correct and I was aware of it.
I was curious of another difference :
I think we confirmed that docCount is local to the field ( thanks Yonik for
that) so :

docCount(index,field1)= # of documents in the index that currently have
value(s) for field1

My question is :

maxDocs(index,field1)= max # of documents in the index that had value(s) for
field1

OR

maxDocs(index)= max # of documents that appeared in the index ( field
independent)

Regards




-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

Yonik Seeley
On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti
<[hidden email]> wrote:

> "Lucene/Solr doesn't actually delete documents when you delete them, it
> just marks them as deleted.  I'm pretty sure that the difference between
> docCount and maxDoc is deleted documents.  Maybe I don't understand what
> I'm talking about, but that is the best I can come up with. "
>
> Thanks Shawn, yes, that is correct and I was aware of it.
> I was curious of another difference :
> I think we confirmed that docCount is local to the field ( thanks Yonik for
> that) so :
>
> docCount(index,field1)= # of documents in the index that currently have
> value(s) for field1
>
> My question is :
>
> maxDocs(index,field1)= max # of documents in the index that had value(s) for
> field1
>
> OR
>
> maxDocs(index)= max # of documents that appeared in the index ( field
> independent)

The latter.
I imagine that's why docCount was introduced (to avoid changing the
meaning of an existing term).
FWIW, the scoring change was made in
https://issues.apache.org/jira/browse/LUCENE-6711 for Lucene/Solr 6.0

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

Doug Turnbull
Just a piece of feedback from clients on the original docCount change.

I have seen several cases with clients where the switch to docCount
surprised and harmed  relevance.

More broadly, I’m concerned when we make these changes there’s not a
testing process against test corpuses with judgments and relevance metrics
to understand their impact. I see it mentioned in a JIRA from time to time
that someone saw an improvement on a private collection in NDCG. And we
have to take their word for it.

Public testing of relevance against every build using stock settings could
be extremely valuable and would more easily justify these changes.
Something similar to the performance tests that are made.

Sadly I can only complain now :) I wish I had time to work on something
like this.

Doug

On Tue, Dec 5, 2017 at 7:38 AM Yonik Seeley <[hidden email]> wrote:

> On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti
> <[hidden email]> wrote:
> > "Lucene/Solr doesn't actually delete documents when you delete them, it
> > just marks them as deleted.  I'm pretty sure that the difference between
> > docCount and maxDoc is deleted documents.  Maybe I don't understand what
> > I'm talking about, but that is the best I can come up with. "
> >
> > Thanks Shawn, yes, that is correct and I was aware of it.
> > I was curious of another difference :
> > I think we confirmed that docCount is local to the field ( thanks Yonik
> for
> > that) so :
> >
> > docCount(index,field1)= # of documents in the index that currently have
> > value(s) for field1
> >
> > My question is :
> >
> > maxDocs(index,field1)= max # of documents in the index that had value(s)
> for
> > field1
> >
> > OR
> >
> > maxDocs(index)= max # of documents that appeared in the index ( field
> > independent)
>
> The latter.
> I imagine that's why docCount was introduced (to avoid changing the
> meaning of an existing term).
> FWIW, the scoring change was made in
> https://issues.apache.org/jira/browse/LUCENE-6711 for Lucene/Solr 6.0
>
> -Yonik
>
--
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

alessandro.benedetti
Thanks Yonik and thanks Doug.

I agree with Doug in adding few generics test corpora Jenkins automatically
runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
golden truth too much.
This of course can be very complex, but I think it is a direction the Apache
Lucene/Solr community should work on.

Given that, I do believe that in this case, moving from maxDocs(field
independent) to docCount(field dependent) was a good move ( and this
specific multi language use case is an example).

Actually I also believe that theoretically docCount(field dependent) is
still better than maxDocs(field dependent).
This is because docCount(field dependent) represents a state in time
associated to the current index while maxDocs represents an historical
consideration.
A corpus of documents can change in time, and how much a term is rare can
drastically change ( let's pick an highly dynamic domain such news).

Doug, were you able to generalise and abstract any consideration from what
happened to your customers and why they got regressions moving from maxDocs
to docCount(field dependent) ?




-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Skewed IDF in multi lingual index, again

Doug Turnbull
It is challenging as the performance of different use cases and domains
will by very dependent on the use case (there's no one globally perfect
relevance solution). But a good set of metrics to see *generally* how stock
Solr performs across a reasonable set of verticals would be nice.

My philosophy about Lucene-based search is that it's not a solution, but
rather a framework that should have sane defaults but large amounts of
configurability.

For example,I'm not sure there's a globally "right" answer maxDoc vs
docCount

Problems with docCount come into play when a corpus usually has an empty
field, but it's occasionally filled out. This creates a strong bias against
matches in that usually empty field, when previously a match in that field
was weighted very highly

For example, if a product catalog has a user-editable tag field that is
rarely used, and a product description, such as

Product Name: Nice Pants!
Product Description: Come wear these pants!
Tags: [blue] [acid-wash]

Product Name: Acid Wash Pants
Product Description: Come wear these pants!
Tags: (empty)

In this case, the IDF for the acid wash match in tags is very low using
docCount whereas with maxDocs it was very high. Not sure what the right
answer is, but there is often a desire to want more complete docs to be
boosted much higher, which the "maxDocs" method does.

Another case where docCount can be a problem is copy fields: With copy
fields, you care that the original field had terms, even if for some reason
they were removed in the analysis chain. This can happen with some methods
we use for simple entity extraction.

Further the definitions of BM25, etc rely on corpus level document
frequency for a term and don't have a concept of fields. BM25F can mostly
be implemented with BlendedTermQuery which blends doc frequencies across
fields
http://opensourceconnections.com/blog/2016/10/19/bm25f-in-lucene/


On Tue, Dec 5, 2017 at 10:28 AM alessandro.benedetti <[hidden email]>
wrote:

> Thanks Yonik and thanks Doug.
>
> I agree with Doug in adding few generics test corpora Jenkins automatically
> runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
> golden truth too much.
> This of course can be very complex, but I think it is a direction the
> Apache
> Lucene/Solr community should work on.
>
> Given that, I do believe that in this case, moving from maxDocs(field
> independent) to docCount(field dependent) was a good move ( and this
> specific multi language use case is an example).
>
> Actually I also believe that theoretically docCount(field dependent) is
> still better than maxDocs(field dependent).
> This is because docCount(field dependent) represents a state in time
> associated to the current index while maxDocs represents an historical
> consideration.
> A corpus of documents can change in time, and how much a term is rare can
> drastically change ( let's pick an highly dynamic domain such news).
>
> Doug, were you able to generalise and abstract any consideration from what
> happened to your customers and why they got regressions moving from maxDocs
> to docCount(field dependent) ?
>
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
--
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)