Preventing short documents from being boosted

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Preventing short documents from being boosted

Tim.Wright
Hi all,

We have an issue where around 10-20% of our documents are much shorter
(only a paragraph or so of text) than all the rest. Because Lucene
considers document length when indexing, most of the time these shorter
documents end up being scored higher than the longer ones.

We'd prefer it if we could remove the length factor, or at least reduce
the weight of it so that we returned a mixture of long and short
documents. Is there a simple way of doing this? I've considered applying
a document boost based on length, but I'm not quite sure of the equation
I'd have to use to "counter" the innate prioritisation of short
documents.

Cheers,

Tim.

--------------------------------------------------------------------------------------------------------------------------------------------
The information contained in this email message may be confidential. If you are not the intended recipient, any use, interference with, disclosure or copying of this material is unauthorised and prohibited. Although this message and any attachments are believed to be free of viruses, no responsibility is accepted by Informa for any loss or damage arising in any way from receipt or use thereof.  Messages to and from the company are monitored for operational reasons and in accordance with lawful business practices.
If you have received this message in error, please notify us by return and delete the message and any attachments.  Further enquiries/returns can be sent to [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Preventing short documents from being boosted

Grant Ingersoll
http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967

-Grant

On Sep 8, 2006, at 5:57 AM, Wright, Tim wrote:

> Hi all,
>
> We have an issue where around 10-20% of our documents are much shorter
> (only a paragraph or so of text) than all the rest. Because Lucene
> considers document length when indexing, most of the time these  
> shorter
> documents end up being scored higher than the longer ones.
>
> We'd prefer it if we could remove the length factor, or at least  
> reduce
> the weight of it so that we returned a mixture of long and short
> documents. Is there a simple way of doing this? I've considered  
> applying
> a document boost based on length, but I'm not quite sure of the  
> equation
> I'd have to use to "counter" the innate prioritisation of short
> documents.
>
> Cheers,
>
> Tim.
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> The information contained in this email message may be  
> confidential. If you are not the intended recipient, any use,  
> interference with, disclosure or copying of this material is  
> unauthorised and prohibited. Although this message and any  
> attachments are believed to be free of viruses, no responsibility  
> is accepted by Informa for any loss or damage arising in any way  
> from receipt or use thereof.  Messages to and from the company are  
> monitored for operational reasons and in accordance with lawful  
> business practices.
> If you have received this message in error, please notify us by  
> return and delete the message and any attachments.  Further  
> enquiries/returns can be sent to [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Preventing short documents from being boosted

Daniel Naber-5
On Freitag 08 September 2006 13:30, Grant Ingersoll wrote:

> http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967

I'd be happy about feedback about that similarity class, i.e. whether
someone has used it successfully. If so, we could add it to the Lucene
core (the old similarity would stay the default though).

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]