Vector Space Model: New Similarity Implementation Issues

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Vector Space Model: New Similarity Implementation Issues

Dharmalingam
Hi List,

I am pretty new to Lucene. Certainly, it is very exciting. I need to implement a new Similarity class based on the Term Vector Space Model given in http://www.miislita.com/term-vector/term-vector-3.html

Although that model is similar to Lucene’s model (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html), I am having hard time to extend the Similarity class to calculate that model.

In that model, “tf” is multiplied with Idf for all terms in the index, but in Lucene “tf” is calculated only for terms in the given Query. Because of that effect, the norm calculation should also include “idf” for all terms. Lucene calculates the norm, during indexing, by “just” counting the number of terms per document. In the web formula (in miislita.com), a document norm is calculated after multiplying “tf” and “idf”.

FYI: I could implement “idf” according to miisliat.com formula, but not the “tf” and “norm”

Could you please comment me how I can implement a new Similarity class that will fit in the Lucene’s architecture, but still implement the vector space model given in miislita.com

Thanks a lot for your comments,

Dharma
Reply | Threaded
Open this post in threaded view
|

Re: Vector Space Model: New Similarity Implementation Issues

Grant Ingersoll-2
Not sure I am understanding what you are asking, but I will give it a  
shot.   See below


On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:

>
> Hi List,
>
> I am pretty new to Lucene. Certainly, it is very exciting. I need to
> implement a new Similarity class based on the Term Vector Space  
> Model given
> in http://www.miislita.com/term-vector/term-vector-3.html
>
> Although that model is similar to Lucene’s model
> (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html 
> ),
> I am having hard time to extend the Similarity class to calculate that
> model.
>
> In that model, “tf” is multiplied with Idf for all terms in the  
> index, but
> in Lucene “tf” is calculated only for terms in the given Query.  
> Because of
> that effect, the norm calculation should also include “idf” for all  
> terms.
> Lucene calculates the norm, during indexing, by “just” counting the  
> number
> of terms per document. In the web formula (in miislita.com), a  
> document norm
> is calculated after multiplying “tf” and “idf”.

Are you wondering if there is a way to score all documents regardless  
of whether the document has the term or not?  I don't quite get your  
statement: "In that model, “tf” is multiplied with Idf for all terms  
in the index, but in Lucene “tf” is calculated only for terms in the  
given Query."

Isn't the result for those documents that don't have query terms just  
going to be 0 or am I not fully understanding?  I briefly skimmed the  
paper you cite and it doesn't seem that different, it's just  
describing the Salton's VSM right?

>
>
> FYI: I could implement “idf” according to miisliat.com formula, but  
> not the
> “tf” and “norm”
>
> Could you please comment me how I can implement a new Similarity  
> class that
> will fit in the Lucene’s architecture, but still implement the  
> vector space
> model given in miislita.com

In the end, you may need to implement some lower level Query classes,  
but I still don't fully understand what you are trying to do, so I  
wouldn't head down that path just yet.

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Vector Space Model: New Similarity Implementation Issues

Dharmalingam
Thanks for the reply. Sorry if my explanation is not clear. Yes, you are correct the model is based on  Salton's VSM. However, the calculation of the term weight and the doc norm is, in my opinion, different from Lucene. If you look at the table given in http://www.miislita.com/term-vector/term-vector-3.html, they calcuate the document norm based on the weight wi=tfi*idfi. I looked at the interfaces of Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
 }

You can see that this lengthNorm for a doc is quite different from that website norm calculation.

Similarly, the querynorm interface of DefaultSimilarity class is:

 /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
  public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
  }

This is again different the website model.

I also have difficulities with tf interface of DefaultSimilarity:
/** Implemented as <code>sqrt(freq)</code>. */
  public float tf(float freq) {
    return (float)Math.sqrt(freq);
  }

In that website model, a tf refers to the frequency of a term within a doc.

I hope explained it better. Please let me know if it is unclear. I am looking for an easy way to implement that table, and of course want to integrate with my lucene (  i.e., myIndexWriter.setSimilarity(new mySimilarity());) Will this be possible by just somehow inheriting the base classes of Lucene.

Thanks for your advice.
Grant Ingersoll-6 wrote
Not sure I am understanding what you are asking, but I will give it a  
shot.   See below


On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:

>
> Hi List,
>
> I am pretty new to Lucene. Certainly, it is very exciting. I need to
> implement a new Similarity class based on the Term Vector Space  
> Model given
> in http://www.miislita.com/term-vector/term-vector-3.html
>
> Although that model is similar to Lucene’s model
> (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html 
> ),
> I am having hard time to extend the Similarity class to calculate that
> model.
>
> In that model, “tf” is multiplied with Idf for all terms in the  
> index, but
> in Lucene “tf” is calculated only for terms in the given Query.  
> Because of
> that effect, the norm calculation should also include “idf” for all  
> terms.
> Lucene calculates the norm, during indexing, by “just” counting the  
> number
> of terms per document. In the web formula (in miislita.com), a  
> document norm
> is calculated after multiplying “tf” and “idf”.

Are you wondering if there is a way to score all documents regardless  
of whether the document has the term or not?  I don't quite get your  
statement: "In that model, “tf” is multiplied with Idf for all terms  
in the index, but in Lucene “tf” is calculated only for terms in the  
given Query."

Isn't the result for those documents that don't have query terms just  
going to be 0 or am I not fully understanding?  I briefly skimmed the  
paper you cite and it doesn't seem that different, it's just  
describing the Salton's VSM right?

>
>
> FYI: I could implement “idf” according to miisliat.com formula, but  
> not the
> “tf” and “norm”
>
> Could you please comment me how I can implement a new Similarity  
> class that
> will fit in the Lucene’s architecture, but still implement the  
> vector space
> model given in miislita.com

In the end, you may need to implement some lower level Query classes,  
but I still don't fully understand what you are trying to do, so I  
wouldn't head down that path just yet.

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Vector Space Model: New Similarity Implementation Issues

Grant Ingersoll-2

On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:

>
> Thanks for the reply. Sorry if my explanation is not clear. Yes, you  
> are
> correct the model is based on  Salton's VSM. However, the  
> calculation of the
> term weight and the doc norm is, in my opinion, different from  
> Lucene. If
> you look at the table given in
> http://www.miislita.com/term-vector/term-vector-3.html, they  
> calcuate the
> document norm based on the weight wi=tfi*idfi. I looked at the  
> interfaces of
> Similarity and DefaultSimilairty class. I place it below:
>
> public float lengthNorm(String fieldName, int numTerms) {
>    return (float)(1.0 / Math.sqrt(numTerms));
> }
>
> You can see that this lengthNorm for a doc is quite different from  
> that
> website norm calculation.

The lengthNorm method is different from the IDF calculation.  In the  
Similarity class, that is handled by the idf() method.  Length norm is  
an attempt to address one of the limitations listed further down in  
that paper:
"Long Documents: Very long documents make similarity measures  
difficult (vectors with small dot products and high dimensionality)"



>
>
> Similarly, the querynorm interface of DefaultSimilarity class is:
>
> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>  public float queryNorm(float sumOfSquaredWeights) {
>    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>  }
>
> This is again different the website model.

Query norm is an attempt to allow for comparison of scores across  
queries, but I don't think one should do that anyway.


>
>
> I also have difficulities with tf interface of DefaultSimilarity:
> /** Implemented as <code>sqrt(freq)</code>. */
>  public float tf(float freq) {
>    return (float)Math.sqrt(freq);
>  }
>

These are all callback methods from within the Scorer classes that  
each Query uses.  Have a look at TermScorer for how these things get  
called.


Try this as an example:

Setup a really simple index with 1 or 2 docs each with a few words.  
Setup a simple Similarity class where you override all of these  
methods to return 1 (or some simple default)
and then index your documents and do a few queries.

Then, have a look at Searcher.explain() to see why a document scores  
the way it does.  Then, you can work to modify from there.

Here's the bigger question:  what is your ultimate goal here?  Are you  
just trying to understand Lucene at an academic/programming level or  
do you have something you are trying to achieve in terms of relevance?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Vector Space Model: New Similarity Implementation Issues

Dharmalingam
Thanks for your tips. My overall goal is to quickly implement 7 variants of vector space model using Lucene. You can find these variants in the updloaded file.

I am doing all these stuffs for a much broader goal: I am trying to recover traceability links from requirements to source code files. I treat every requirement as a query. In this problem, I would like to compare these collection of algorithms for their relevance.



Grant Ingersoll-6 wrote
On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:

>
> Thanks for the reply. Sorry if my explanation is not clear. Yes, you  
> are
> correct the model is based on  Salton's VSM. However, the  
> calculation of the
> term weight and the doc norm is, in my opinion, different from  
> Lucene. If
> you look at the table given in
> http://www.miislita.com/term-vector/term-vector-3.html, they  
> calcuate the
> document norm based on the weight wi=tfi*idfi. I looked at the  
> interfaces of
> Similarity and DefaultSimilairty class. I place it below:
>
> public float lengthNorm(String fieldName, int numTerms) {
>    return (float)(1.0 / Math.sqrt(numTerms));
> }
>
> You can see that this lengthNorm for a doc is quite different from  
> that
> website norm calculation.

The lengthNorm method is different from the IDF calculation.  In the  
Similarity class, that is handled by the idf() method.  Length norm is  
an attempt to address one of the limitations listed further down in  
that paper:
"Long Documents: Very long documents make similarity measures  
difficult (vectors with small dot products and high dimensionality)"



>
>
> Similarly, the querynorm interface of DefaultSimilarity class is:
>
> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>  public float queryNorm(float sumOfSquaredWeights) {
>    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>  }
>
> This is again different the website model.

Query norm is an attempt to allow for comparison of scores across  
queries, but I don't think one should do that anyway.


>
>
> I also have difficulities with tf interface of DefaultSimilarity:
> /** Implemented as <code>sqrt(freq)</code>. */
>  public float tf(float freq) {
>    return (float)Math.sqrt(freq);
>  }
>

These are all callback methods from within the Scorer classes that  
each Query uses.  Have a look at TermScorer for how these things get  
called.


Try this as an example:

Setup a really simple index with 1 or 2 docs each with a few words.  
Setup a simple Similarity class where you override all of these  
methods to return 1 (or some simple default)
and then index your documents and do a few queries.

Then, have a look at Searcher.explain() to see why a document scores  
the way it does.  Then, you can work to modify from there.

Here's the bigger question:  what is your ultimate goal here?  Are you  
just trying to understand Lucene at an academic/programming level or  
do you have something you are trying to achieve in terms of relevance?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
ieee-sw-rank.pdf
Reply | Threaded
Open this post in threaded view
|

Re: Vector Space Model: New Similarity Implementation Issues

Grant Ingersoll-2
FYI: The mailing list handler strips attachments.

At any rate, sounds like an interesting project.  I don't know how  
easy it will be for you to implement 7 variants of VSM in Lucene given  
the nature of the APIs, but if you do, it might be handy to see your  
changes as a patch.  :-)  Also not quite sure what all those variants  
will help with when it comes to your broader goal, but that isn't for  
me to decide :-)  Seems like your goal is to find the traceability  
stuff, not see if you can figure out how to change Lucene's  
similarity!  To that end, my two cents would be to focus on creating  
the right kinds of queries, analyzers, etc.


-Grant

On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:

>
> Thanks for your tips. My overall goal is to quickly implement 7  
> variants of
> vector space model using Lucene. You can find these variants in the
> updloaded file.
>
> I am doing all these stuffs for a much broader goal: I am trying to  
> recover
> traceability links from requirements to source code files. I treat  
> every
> requirement as a query. In this problem, I would like to compare these
> collection of algorithms for their relevance.
>
>
>
>
> Grant Ingersoll-6 wrote:
>>
>>
>> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
>>
>>>
>>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you
>>> are
>>> correct the model is based on  Salton's VSM. However, the
>>> calculation of the
>>> term weight and the doc norm is, in my opinion, different from
>>> Lucene. If
>>> you look at the table given in
>>> http://www.miislita.com/term-vector/term-vector-3.html, they
>>> calcuate the
>>> document norm based on the weight wi=tfi*idfi. I looked at the
>>> interfaces of
>>> Similarity and DefaultSimilairty class. I place it below:
>>>
>>> public float lengthNorm(String fieldName, int numTerms) {
>>>   return (float)(1.0 / Math.sqrt(numTerms));
>>> }
>>>
>>> You can see that this lengthNorm for a doc is quite different from
>>> that
>>> website norm calculation.
>>
>> The lengthNorm method is different from the IDF calculation.  In the
>> Similarity class, that is handled by the idf() method.  Length norm  
>> is
>> an attempt to address one of the limitations listed further down in
>> that paper:
>> "Long Documents: Very long documents make similarity measures
>> difficult (vectors with small dot products and high dimensionality)"
>>
>>
>>
>>>
>>>
>>> Similarly, the querynorm interface of DefaultSimilarity class is:
>>>
>>> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>>> public float queryNorm(float sumOfSquaredWeights) {
>>>   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>>> }
>>>
>>> This is again different the website model.
>>
>> Query norm is an attempt to allow for comparison of scores across
>> queries, but I don't think one should do that anyway.
>>
>>
>>>
>>>
>>> I also have difficulities with tf interface of DefaultSimilarity:
>>> /** Implemented as <code>sqrt(freq)</code>. */
>>> public float tf(float freq) {
>>>   return (float)Math.sqrt(freq);
>>> }
>>>
>>
>> These are all callback methods from within the Scorer classes that
>> each Query uses.  Have a look at TermScorer for how these things get
>> called.
>>
>>
>> Try this as an example:
>>
>> Setup a really simple index with 1 or 2 docs each with a few words.
>> Setup a simple Similarity class where you override all of these
>> methods to return 1 (or some simple default)
>> and then index your documents and do a few queries.
>>
>> Then, have a look at Searcher.explain() to see why a document scores
>> the way it does.  Then, you can work to modify from there.
>>
>> Here's the bigger question:  what is your ultimate goal here?  Are  
>> you
>> just trying to understand Lucene at an academic/programming level or
>> do you have something you are trying to achieve in terms of  
>> relevance?
>>
>> -Grant
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
> http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
> --
> View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Vector Space Model: New Similarity Implementation Issues

Dharmalingam
You can find those variants of the vector space model in this interesting article: http://ieeexplore.ieee.org/iel1/52/12658/00582976.pdf?tp=&isnumber=&arnumber=582976

Now, I got confirmed with you the current nature of Similarity API's will be not easy to quickly realize these variants.

Actually, I implemented the earlier web-site model as a separate Java program, which uses Lucene classes, but not through inherting the Similarity class. It appears inherting similarity class will not solve my problem of realization these variant

Grant Ingersoll-6 wrote
FYI: The mailing list handler strips attachments.

At any rate, sounds like an interesting project.  I don't know how  
easy it will be for you to implement 7 variants of VSM in Lucene given  
the nature of the APIs, but if you do, it might be handy to see your  
changes as a patch.  :-)  Also not quite sure what all those variants  
will help with when it comes to your broader goal, but that isn't for  
me to decide :-)  Seems like your goal is to find the traceability  
stuff, not see if you can figure out how to change Lucene's  
similarity!  To that end, my two cents would be to focus on creating  
the right kinds of queries, analyzers, etc.


-Grant

On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:

>
> Thanks for your tips. My overall goal is to quickly implement 7  
> variants of
> vector space model using Lucene. You can find these variants in the
> updloaded file.
>
> I am doing all these stuffs for a much broader goal: I am trying to  
> recover
> traceability links from requirements to source code files. I treat  
> every
> requirement as a query. In this problem, I would like to compare these
> collection of algorithms for their relevance.
>
>
>
>
> Grant Ingersoll-6 wrote:
>>
>>
>> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
>>
>>>
>>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you
>>> are
>>> correct the model is based on  Salton's VSM. However, the
>>> calculation of the
>>> term weight and the doc norm is, in my opinion, different from
>>> Lucene. If
>>> you look at the table given in
>>> http://www.miislita.com/term-vector/term-vector-3.html, they
>>> calcuate the
>>> document norm based on the weight wi=tfi*idfi. I looked at the
>>> interfaces of
>>> Similarity and DefaultSimilairty class. I place it below:
>>>
>>> public float lengthNorm(String fieldName, int numTerms) {
>>>   return (float)(1.0 / Math.sqrt(numTerms));
>>> }
>>>
>>> You can see that this lengthNorm for a doc is quite different from
>>> that
>>> website norm calculation.
>>
>> The lengthNorm method is different from the IDF calculation.  In the
>> Similarity class, that is handled by the idf() method.  Length norm  
>> is
>> an attempt to address one of the limitations listed further down in
>> that paper:
>> "Long Documents: Very long documents make similarity measures
>> difficult (vectors with small dot products and high dimensionality)"
>>
>>
>>
>>>
>>>
>>> Similarly, the querynorm interface of DefaultSimilarity class is:
>>>
>>> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>>> public float queryNorm(float sumOfSquaredWeights) {
>>>   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>>> }
>>>
>>> This is again different the website model.
>>
>> Query norm is an attempt to allow for comparison of scores across
>> queries, but I don't think one should do that anyway.
>>
>>
>>>
>>>
>>> I also have difficulities with tf interface of DefaultSimilarity:
>>> /** Implemented as <code>sqrt(freq)</code>. */
>>> public float tf(float freq) {
>>>   return (float)Math.sqrt(freq);
>>> }
>>>
>>
>> These are all callback methods from within the Scorer classes that
>> each Query uses.  Have a look at TermScorer for how these things get
>> called.
>>
>>
>> Try this as an example:
>>
>> Setup a really simple index with 1 or 2 docs each with a few words.
>> Setup a simple Similarity class where you override all of these
>> methods to return 1 (or some simple default)
>> and then index your documents and do a few queries.
>>
>> Then, have a look at Searcher.explain() to see why a document scores
>> the way it does.  Then, you can work to modify from there.
>>
>> Here's the bigger question:  what is your ultimate goal here?  Are  
>> you
>> just trying to understand Lucene at an academic/programming level or
>> do you have something you are trying to achieve in terms of  
>> relevance?
>>
>> -Grant
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
> http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
> --
> View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Vector Space Model: New Similarity Implementation Issues

Eric Th
In reply to this post by Dharmalingam
Compare with classical VSM, lucene just ignore the denominator (|Q|*|D|) of
similarity formula,
but it add norm(t,d) and coord(q,d) to calculate the fraction of terms in
Query and Doc,
so it's a modified implementation of VSM in practice.
 Do you just want to verify which implementation of VSM in "ieee-sw-rank" is
more precise in practice by lucene?
If so, it's an useful experiment.

2008/2/27, Dharmalingam <[hidden email]>:

>
>
> Hi List,
>
> I am pretty new to Lucene. Certainly, it is very exciting. I need to
> implement a new Similarity class based on the Term Vector Space Model
> given
> in http://www.miislita.com/term-vector/term-vector-3.html
>
> Although that model is similar to Lucene's model
> (
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
> ),
> I am having hard time to extend the Similarity class to calculate that
> model.
>
> In that model, "tf" is multiplied with Idf for all terms in the index, but
> in Lucene "tf" is calculated only for terms in the given Query. Because of
> that effect, the norm calculation should also include "idf" for all terms.
> Lucene calculates the norm, during indexing, by "just" counting the number
> of terms per document. In the web formula (in miislita.com), a document
> norm
> is calculated after multiplying "tf" and "idf".
>
> FYI: I could implement "idf" according to miisliat.com formula, but not
> the
> "tf" and "norm"
>
> Could you please comment me how I can implement a new Similarity class
> that
> will fit in the Lucene's architecture, but still implement the vector
> space
> model given in miislita.com
>
> Thanks a lot for your comments,
>
> Dharma
>
>
> --
> View this message in context:
> http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15696719.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>