Relative term frequency?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Relative term frequency?

Andy Liu-3
Is there a way to calculate term frequency scores that are relative to
the number of terms in the field of the document?  We want to override
tf() in this way to curb keyword spamming in web pages.  In
Similarity, only the document's term frequency is passed into the tf()
method:

float tf(int freq)

It would be nice to have something like:

float tf(int freq, String fieldName, int numTerms)

If this isn't available out of the box, how difficult would it be to
hack up Lucene to allow for this?

Thanks,
Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Relative term frequency?

Paul Elschot
On Monday 06 June 2005 22:59, Andy Liu wrote:

> Is there a way to calculate term frequency scores that are relative to
> the number of terms in the field of the document?  We want to override
> tf() in this way to curb keyword spamming in web pages.  In
> Similarity, only the document's term frequency is passed into the tf()
> method:
>
> float tf(int freq)
>
> It would be nice to have something like:
>
> float tf(int freq, String fieldName, int numTerms)
>
> If this isn't available out of the box, how difficult would it be to
> hack up Lucene to allow for this?

Have a look here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31784

It scores terms by density and it uses a separate table mapping
the norms stored in the index to inverse doc lengths.
This table could be adapted as needed.
When that is not enough, it's probably a good start for what
you need.

Regards,
Paul Elschot.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]