Changing the scoring (newest doc date first)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Changing the scoring (newest doc date first)

Marcus Falck
Hello,
 
I'm working on a very large implementation of a search engine based on the lucene api (1.4.3). We have also been investigating enterprise search companies such as FAST and Verity but have come to the conclusion that we might aswell save ourselves 1 millon dollars by doing our own implementation on lucene.
 
What we are talking about here is to index up data from alot of different system all containing ALOT of document. This index will be distributed by range ( date ) and scaled with 1 or more machines containing the same index per range (load balanced using round robin).
 
Currently the total size of all documents we need to index is around 2TB (200 million documents) but this is growing with approximentely 200 000 document on a daily basis.
 
I have already written code for a prototype that contains fetcher application, for fetching data from the orignal systems storages and distributes the documents using SOAP over TCP to the correct data intervall (and the intervalls machines), SearchMachineHost (the actual index/search per machine), Search/Index api (that adds transparancy to the whole clustering part), AlertHost (for time sensetive alerts) and demo applications. Every thing looks very good we are very satisfied with the performance.
 
---PROBLEM--
There is however one LARGE problem that we have run into. All search result should be displayed sorted with the newest document at top. We tried to accomplish this using Lucene's sort capabilites but quickly ran into large performance bottlenecks. So i figured since the default sort is by relevance i would like to change the relevance so that we don't even need to sort the documents. I guess alot of people at this mail list can give me valuable hints about how to accomplish this!
(Since i now about the ability to sort by index id (which i haven't tried) I can also add that i can't guarantee that all documents will be added in correct date order (remember the several systems,  the future plans is to buy content from different actors on the market and index it up).
 
Please help me in my fight against FAST and Verity =D
 
/ Regards
Marcus Falck, Stockholm, Sweden.
 
I would also like to thank all people that have been involved in the lucene development.
Very nice work!
 
Reply | Threaded
Open this post in threaded view
|

RE: Changing the scoring (newest doc date first)

Mordo, Aviran (EXP N-NANNATEK)
When you write your query, you can add a date range with a boot factor
for this field, i.e boost y a factor x the documents that have a date of
today, boost  by x-1 the documents from the past wee, boost by x-2 the
documents from the past two weeks, etc'.

This will not be a perfect sort on the dates but it will boost newer
documents depends on your date range.

HTH

Aviran
http://www.aviransplace.com

-----Original Message-----
From: Marcus Falck [mailto:[hidden email]]
Sent: Tuesday, May 16, 2006 2:43 PM
To: [hidden email]
Subject: Changing the scoring (newest doc date first)

Hello,
 
I'm working on a very large implementation of a search engine based on
the lucene api (1.4.3). We have also been investigating enterprise
search companies such as FAST and Verity but have come to the conclusion
that we might aswell save ourselves 1 millon dollars by doing our own
implementation on lucene.
 
What we are talking about here is to index up data from alot of
different system all containing ALOT of document. This index will be
distributed by range ( date ) and scaled with 1 or more machines
containing the same index per range (load balanced using round robin).
 
Currently the total size of all documents we need to index is around 2TB
(200 million documents) but this is growing with approximentely 200 000
document on a daily basis.
 
I have already written code for a prototype that contains fetcher
application, for fetching data from the orignal systems storages and
distributes the documents using SOAP over TCP to the correct data
intervall (and the intervalls machines), SearchMachineHost (the actual
index/search per machine), Search/Index api (that adds transparancy to
the whole clustering part), AlertHost (for time sensetive alerts) and
demo applications. Every thing looks very good we are very satisfied
with the performance.
 
---PROBLEM--
There is however one LARGE problem that we have run into. All search
result should be displayed sorted with the newest document at top. We
tried to accomplish this using Lucene's sort capabilites but quickly ran
into large performance bottlenecks. So i figured since the default sort
is by relevance i would like to change the relevance so that we don't
even need to sort the documents. I guess alot of people at this mail
list can give me valuable hints about how to accomplish this!
(Since i now about the ability to sort by index id (which i haven't
tried) I can also add that i can't guarantee that all documents will be
added in correct date order (remember the several systems,  the future
plans is to buy content from different actors on the market and index it
up).
 
Please help me in my fight against FAST and Verity =D
 
/ Regards
Marcus Falck, Stockholm, Sweden.
 
I would also like to thank all people that have been involved in the
lucene development.
Very nice work!
 



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changing the scoring (newest doc date first)

Yonik Seeley
In reply to this post by Marcus Falck
On 5/16/06, Marcus Falck <[hidden email]> wrote:
> I'm working on a very large implementation of a search engine based on the lucene api (1.4.3). We have also been investigating enterprise search companies such as FAST and Verity but have come to the conclusion that we might aswell save ourselves 1 millon dollars by doing our own implementation on lucene.

That's the same conclusion we came to... and how Solr came about.
If it is close enough to meeting your needs, it might make sense to collaborate.

> So i figured since the default sort is by relevance i would like to change the relevance so that we don't even need to sort the documents.

Documents sorted by relevance are still sorted.
How much slower is a sort on another field vs a sort on relevance (not
counting the first time)


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

SV: Changing the scoring (newest doc date first)

Marcus Falck
In reply to this post by Marcus Falck
Yes the default behavior (sort on relevance) is a form of sort. But that sort don't need to access the field values which makes it alot faster.
 
Sorting on fields works good up to index sizes of a couple of gigabytes ( on a test machine dual opteron 2 GB ram)
 
/
Marcus

________________________________

Från: Yonik Seeley [mailto:[hidden email]]
Skickat: on 2006-05-17 20:04
Till: [hidden email]
Ämne: Re: Changing the scoring (newest doc date first)



On 5/16/06, Marcus Falck <[hidden email]> wrote:
> I'm working on a very large implementation of a search engine based on the lucene api (1.4.3). We have also been investigating enterprise search companies such as FAST and Verity but have come to the conclusion that we might aswell save ourselves 1 millon dollars by doing our own implementation on lucene.

That's the same conclusion we came to... and how Solr came about.
If it is close enough to meeting your needs, it might make sense to collaborate.

> So i figured since the default sort is by relevance i would like to change the relevance so that we don't even need to sort the documents.

Documents sorted by relevance are still sorted.
How much slower is a sort on another field vs a sort on relevance (not
counting the first time)


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Changing the scoring (newest doc date first)

Doug Cutting
In reply to this post by Marcus Falck
Marcus Falck wrote:
> There is however one LARGE problem that we have run into. All search result should be displayed sorted with the newest document at top. We tried to accomplish this using Lucene's sort capabilites but quickly ran into large performance bottlenecks. So i figured since the default sort is by relevance i would like to change the relevance so that we don't even need to sort the documents. I guess alot of people at this mail list can give me valuable hints about how to accomplish this!
> (Since i now about the ability to sort by index id (which i haven't tried) I can also add that i can't guarantee that all documents will be added in correct date order (remember the several systems,  the future plans is to buy content from different actors on the market and index it up).

A HitCollector should help.  Matching documents are passed to a
HitCollector in the order they were added to the index.  So if newer
documents were added to your index later, then the newest N documents
are simply the last N documents passed to the HitCollector.

Could that work?

Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Changing the scoring (newest doc date first)

Marcus Falck
In reply to this post by Marcus Falck
Hmm.
Not sure that I understand exactly what you mean.
Doesn't your solution require me to add all documents in correct date range?
Since I will index articles from different systems I can't guarantee that all articles will be added to the index in correct date order.
 
/
Marcus

________________________________

From: Doug Cutting [mailto:[hidden email]]
Sent: Tue 5/23/2006 12:54 AM
To: [hidden email]
Subject: Re: Changing the scoring (newest doc date first)



Marcus Falck wrote:
> There is however one LARGE problem that we have run into. All search result should be displayed sorted with the newest document at top. We tried to accomplish this using Lucene's sort capabilites but quickly ran into large performance bottlenecks. So i figured since the default sort is by relevance i would like to change the relevance so that we don't even need to sort the documents. I guess alot of people at this mail list can give me valuable hints about how to accomplish this!

> (Since i now about the ability to sort by index id (which i haven't tried) I can also add that i can't guarantee that all documents will be added in correct date order (remember the several systems,  the future plans is to buy content from different actors on the market and index it up).

A HitCollector should help.  Matching documents are passed to a
HitCollector in the order they were added to the index.  So if newer
documents were added to your index later, then the newest N documents
are simply the last N documents passed to the HitCollector.

Could that work?

Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]