Jensen–Shannon divergence

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Jensen–Shannon divergence

Shay Hummel
Hi

I need help to implement similarity between query model and document model.
I would like to use the JS-Divergence
<https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence> for
ranking documents. The documents and the query will be represented
according to the language models approach - specifically the LMDiriclet.
The similarity will be calculated using the JS-Div between the document
model and the query model.
Is it possible?
if so how?

Thank you,
Shay Hummel
--
Regards,
Shay Hummel
Reply | Threaded
Open this post in threaded view
|

Re: Jensen–Shannon divergence

wmartinusa
expand your due diligence beyond wikipedia:
i.e.

http://ciir.cs.umass.edu/pubfiles/ir-464.pdf



> On Dec 13, 2015, at 8:30 AM, Shay Hummel <[hidden email]> wrote:
>
> LMDiricletbut its feasibilit
Reply | Threaded
Open this post in threaded view
|

Re: Jensen–Shannon divergence

Shay Hummel
Hi

I am sorry but I didn't understand your answer. Can you please elaborate?

Shay

On Sun, Dec 13, 2015 at 3:41 PM will martin <[hidden email]> wrote:

> expand your due diligence beyond wikipedia:
> i.e.
>
> http://ciir.cs.umass.edu/pubfiles/ir-464.pdf
>
>
>
> > On Dec 13, 2015, at 8:30 AM, Shay Hummel <[hidden email]> wrote:
> >
> > LMDiricletbut its feasibilit
>
--
Regards,
Shay Hummel
Reply | Threaded
Open this post in threaded view
|

Re: Jensen–Shannon divergence

wmartinusa
Sorry it was early.

If you go looking on the web, you can find, as I did reputable work on implementing DiricletLanguage Models. However, at this hour you might get answers here. Extrapolating others work into a lucene implantation is only slightly different from getting answers here. imo

g'luck


> On Dec 13, 2015, at 10:55 AM, Shay Hummel <[hidden email]> wrote:
>
> Hi
>
> I am sorry but I didn't understand your answer. Can you please elaborate?
>
> Shay
>
> On Sun, Dec 13, 2015 at 3:41 PM will martin <[hidden email]> wrote:
>
>> expand your due diligence beyond wikipedia:
>> i.e.
>>
>> http://ciir.cs.umass.edu/pubfiles/ir-464.pdf
>>
>>
>>
>>> On Dec 13, 2015, at 8:30 AM, Shay Hummel <[hidden email]> wrote:
>>>
>>> LMDiricletbut its feasibilit
>>
> --
> Regards,
> Shay Hummel


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Jensen–Shannon divergence

Ahmet Arslan
Hi Shay,

I suggest you to extend o.a.l.search.similarities.SimilarityBase.
All you need to implement a score() method. After all fancy names (language models, etc), a similarity is a function of seven salient statistics. It is actually six: avgFieldLength can derived from other two (numberOfFieldTokens divided by numberOfDocuments)

Seven Statistics come from,
Corpus statistics : numberOfDocuments, numberOfFieldTokens, avgFieldLength
Term statistics: totalTermFreq and docFreq
About the document being scored : within document term frequency (freq) and document length (docLen)

If you can express your ranking method in terms of these seven variables, you are ready to go. For example my Dirichlet LM model implementation is nothing but :

return log2(1 + (tf / (c * (termFrequency / numberOfTokens)))) + log2(c / (docLength + c));

If you need additional statistics, number of unique terms in a document for example, you need to calculate it by your self and embed it to the index (possibly using DocValues). During scoring, you can retrieve it.

Personally I wondered about your similarity, If possible please let community know about its effectiveness.

Please also see Robert's write-up :
http://lucidworks.com/blog/2011/09/12/flexible-ranking-in-lucene-4/

Thanks,
Ahmet


On Sunday, December 13, 2015 6:28 PM, will martin <[hidden email]> wrote:
Sorry it was early.

If you go looking on the web, you can find, as I did reputable work on implementing DiricletLanguage Models. However, at this hour you might get answers here. Extrapolating others work into a lucene implantation is only slightly different from getting answers here. imo

g'luck



> On Dec 13, 2015, at 10:55 AM, Shay Hummel <[hidden email]> wrote:
>
> Hi
>
> I am sorry but I didn't understand your answer. Can you please elaborate?
>
> Shay
>
> On Sun, Dec 13, 2015 at 3:41 PM will martin <[hidden email]> wrote:
>
>> expand your due diligence beyond wikipedia:
>> i.e.
>>
>> http://ciir.cs.umass.edu/pubfiles/ir-464.pdf
>>
>>
>>
>>> On Dec 13, 2015, at 8:30 AM, Shay Hummel <[hidden email]> wrote:
>>>
>>> LMDiricletbut its feasibilit
>>
> --
> Regards,
> Shay Hummel


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Jensen–Shannon divergence

Jack Krupansky-3
In reply to this post by Shay Hummel
Is there any particular reason that you find Lucene's builtin TF/IDF and
BM25 similarity models insufficient for your needs? In any case,
examination of their source code should get you started if you with to do
your own:

https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/similarities/BM25Similarity.html

-- Jack Krupansky

On Sun, Dec 13, 2015 at 8:30 AM, Shay Hummel <[hidden email]> wrote:

> Hi
>
> I need help to implement similarity between query model and document model.
> I would like to use the JS-Divergence
> <https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence> for
> ranking documents. The documents and the query will be represented
> according to the language models approach - specifically the LMDiriclet.
> The similarity will be calculated using the JS-Div between the document
> model and the query model.
> Is it possible?
> if so how?
>
> Thank you,
> Shay Hummel
> --
> Regards,
> Shay Hummel
>
Reply | Threaded
Open this post in threaded view
|

RE: Jensen–Shannon divergence

Uwe Schindler
Hi,

Next to BM25 and TF-IDF, Lucene also privides many more similarity implementations:

https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/IBSimilarity.html
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/DFRSimilarity.html

If you want to implement your own, choose the closest one and implement the formula as you described. I'll start with SimilarityBase, which is ideal base class for such types like Dirichlet / DFR /..., because it has a default implementation for stuff like phrases.

> LMDiricletbut its feasibilit

I am not sure what you want to say with this mistyped sentence fragment.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Jack Krupansky [mailto:[hidden email]]
> Sent: Monday, December 14, 2015 11:21 PM
> To: [hidden email]
> Subject: Re: Jensen–Shannon divergence
>
> Is there any particular reason that you find Lucene's builtin TF/IDF and
> BM25 similarity models insufficient for your needs? In any case,
> examination of their source code should get you started if you with to do
> your own:
>
> https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/simi
> larities/TFIDFSimilarity.html
> https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/simi
> larities/BM25Similarity.html
>
> -- Jack Krupansky
>
> On Sun, Dec 13, 2015 at 8:30 AM, Shay Hummel <[hidden email]>
> wrote:
>
> > Hi
> >
> > I need help to implement similarity between query model and document
> model.
> > I would like to use the JS-Divergence
> > <https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence>
> for
> > ranking documents. The documents and the query will be represented
> > according to the language models approach - specifically the LMDiriclet.
> > The similarity will be calculated using the JS-Div between the document
> > model and the query model.
> > Is it possible?
> > if so how?
> >
> > Thank you,
> > Shay Hummel
> > --
> > Regards,
> > Shay Hummel
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Jensen–Shannon divergence

wmartinusa
cool list. Thanks Uwe.

Opportunities to gain competitive advantage in selected domains.

> On Dec 14, 2015, at 6:02 PM, Uwe Schindler <[hidden email]> wrote:
>
> Hi,
>
> Next to BM25 and TF-IDF, Lucene also privides many more similarity implementations:
>
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/IBSimilarity.html
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/DFRSimilarity.html
>
> If you want to implement your own, choose the closest one and implement the formula as you described. I'll start with SimilarityBase, which is ideal base class for such types like Dirichlet / DFR /..., because it has a default implementation for stuff like phrases.
>
>> LMDiricletbut its feasibilit
>
> I am not sure what you want to say with this mistyped sentence fragment.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>> -----Original Message-----
>> From: Jack Krupansky [mailto:[hidden email]]
>> Sent: Monday, December 14, 2015 11:21 PM
>> To: [hidden email]
>> Subject: Re: Jensen–Shannon divergence
>>
>> Is there any particular reason that you find Lucene's builtin TF/IDF and
>> BM25 similarity models insufficient for your needs? In any case,
>> examination of their source code should get you started if you with to do
>> your own:
>>
>> https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/simi
>> larities/TFIDFSimilarity.html
>> https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/simi
>> larities/BM25Similarity.html
>>
>> -- Jack Krupansky
>>
>> On Sun, Dec 13, 2015 at 8:30 AM, Shay Hummel <[hidden email]>
>> wrote:
>>
>>> Hi
>>>
>>> I need help to implement similarity between query model and document
>> model.
>>> I would like to use the JS-Divergence
>>> <https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence>
>> for
>>> ranking documents. The documents and the query will be represented
>>> according to the language models approach - specifically the LMDiriclet.
>>> The similarity will be calculated using the JS-Div between the document
>>> model and the query model.
>>> Is it possible?
>>> if so how?
>>>
>>> Thank you,
>>> Shay Hummel
>>> --
>>> Regards,
>>> Shay Hummel
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]