Can one determine which results are "good enough" to alert users about?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Can one determine which results are "good enough" to alert users about?

Chris Harris-2
I'm trying to think through a Solr-based email alerting engine that
would have the following properties:

1. Users can enter queries they want to be alerted on, and the syntax
for alert queries should be the same syntax as my regular solr
(dismax) queries.

1a. Corollary: Because of not just tf-idf but also dismax pf and qf
boosting, this implies that the set of documents that match a given
query will vary widely in quality; the first page of search results
will be quite good, but the last page won't be worth looking at.

2. The email alerting engine shouldn't bother alerting people about
*all* new results for a given query; in particular it should avoid the
poor-quality tail of results and just alert on "the good stuff".

Unfortunately, my current understanding of Solr/Lucene is that there's
not a good automatic way to partition the set of query results into
"good stuff" vs "not good stuff". The main option I know of is to
filter out documents below a certain score threshold, but if you
search the Lucene/Solr mailing lists, people will advise that this is
unlikely to be fruitful. (It ultimately boils down to how Lucene/Solr
scores wasn't especially designed to mean anything as absolute
numbers, only when compared to other scores.)

This makes me wonder if there's something wrong with my original
requirements, or whether people have thought of some other way to
approach this.

Interestingly, Google appears to have solved this at least to some
degree with Google Alerts (http://www.google.com/alerts); there you
can choose to receive "Only the best results" rather than "All the
results". I'm not clear how they determine which results are "best",
but their UI certainly implies they've come up with some scheme for
it.

Thanks,
Chris
Reply | Threaded
Open this post in threaded view
|

Re: Can one determine which results are "good enough" to alert users about?

Jan Høydahl / Cominvent
For such an alerting service, I would make it a requirement that it's WYSIWYG - e.g. let the user enter a search, and then refine it through facets, filters, ranges etc until he is satisfied with ALL the results returned. Do not rely on relevane here, but sort the results by date or similar. You can then make a preview service in which the user can see how many and which results he WOULD have received by email last week/month before he stores his saved query. In that way the user will be more satisfied with the alerts and your implementation is straight-forward.

Of course it depends on the domain as well. For classifieds or e-commerce it is easier to set loads of metadata filters to narrow the search. If your use case is for a domain which inherently is full-text based and always returns a lot of results with a long tail, I would design the service in a way such that the alerts themselves contain only top-N hits, with a link in the email to see all (of course only those which are new or updated since last alert).

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. mai 2012, at 10:50, Chris Harris wrote:

> I'm trying to think through a Solr-based email alerting engine that
> would have the following properties:
>
> 1. Users can enter queries they want to be alerted on, and the syntax
> for alert queries should be the same syntax as my regular solr
> (dismax) queries.
>
> 1a. Corollary: Because of not just tf-idf but also dismax pf and qf
> boosting, this implies that the set of documents that match a given
> query will vary widely in quality; the first page of search results
> will be quite good, but the last page won't be worth looking at.
>
> 2. The email alerting engine shouldn't bother alerting people about
> *all* new results for a given query; in particular it should avoid the
> poor-quality tail of results and just alert on "the good stuff".
>
> Unfortunately, my current understanding of Solr/Lucene is that there's
> not a good automatic way to partition the set of query results into
> "good stuff" vs "not good stuff". The main option I know of is to
> filter out documents below a certain score threshold, but if you
> search the Lucene/Solr mailing lists, people will advise that this is
> unlikely to be fruitful. (It ultimately boils down to how Lucene/Solr
> scores wasn't especially designed to mean anything as absolute
> numbers, only when compared to other scores.)
>
> This makes me wonder if there's something wrong with my original
> requirements, or whether people have thought of some other way to
> approach this.
>
> Interestingly, Google appears to have solved this at least to some
> degree with Google Alerts (http://www.google.com/alerts); there you
> can choose to receive "Only the best results" rather than "All the
> results". I'm not clear how they determine which results are "best",
> but their UI certainly implies they've come up with some scheme for
> it.
>
> Thanks,
> Chris

Reply | Threaded
Open this post in threaded view
|

Re: Can one determine which results are "good enough" to alert users about?

Otis Gospodnetic-2
In reply to this post by Chris Harris-2
Hi Chris,

I think there is some confusion here.
When people say things about relevance scores they talk about comparing them across queries.
What you have is a different situation, or at least a situation that lends itself to working around this, at least partially.

You have N users.
Each user enters N queries.

You have incoming stream of documents that you wan to match against all users' saved queries.

When a new document is matched you could:
1) send it to user right away
2) store it somewhere as a document that matched a query Q and send all matches to users periodically.

If you go with 1) then either you send all matches to users, or you introduce the notion of the score thresholds.  That's bad for the reason you already identified.
If you go with 2) then you have the option of batching up matches for each saved query and alerting users only every N hours.  Then, you could introduce logic that says:
"If there are >N matches for query Q then remove all matches with score <S"
"If there are >M matches for query Q, then remove all matches with score <R"
"If there are <Z matches for query Q, then keep all matches"
...

Maybe you can turn this into a feature in your product ;)

Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 



>________________________________
> From: Chris Harris <[hidden email]>
>To: [hidden email]
>Sent: Wednesday, May 9, 2012 4:50 AM
>Subject: Can one determine which results are "good enough" to alert users about?
>
>I'm trying to think through a Solr-based email alerting engine that
>would have the following properties:
>
>1. Users can enter queries they want to be alerted on, and the syntax
>for alert queries should be the same syntax as my regular solr
>(dismax) queries.
>
>1a. Corollary: Because of not just tf-idf but also dismax pf and qf
>boosting, this implies that the set of documents that match a given
>query will vary widely in quality; the first page of search results
>will be quite good, but the last page won't be worth looking at.
>
>2. The email alerting engine shouldn't bother alerting people about
>*all* new results for a given query; in particular it should avoid the
>poor-quality tail of results and just alert on "the good stuff".
>
>Unfortunately, my current understanding of Solr/Lucene is that there's
>not a good automatic way to partition the set of query results into
>"good stuff" vs "not good stuff". The main option I know of is to
>filter out documents below a certain score threshold, but if you
>search the Lucene/Solr mailing lists, people will advise that this is
>unlikely to be fruitful. (It ultimately boils down to how Lucene/Solr
>scores wasn't especially designed to mean anything as absolute
>numbers, only when compared to other scores.)
>
>This makes me wonder if there's something wrong with my original
>requirements, or whether people have thought of some other way to
>approach this.
>
>Interestingly, Google appears to have solved this at least to some
>degree with Google Alerts (http://www.google.com/alerts); there you
>can choose to receive "Only the best results" rather than "All the
>results". I'm not clear how they determine which results are "best",
>but their UI certainly implies they've come up with some scheme for
>it.
>
>Thanks,
>Chris
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Can one determine which results are "good enough" to alert users about?

Jan Høydahl-3
Hi,

The whole thinking of score threshold is flawed in this situation.
Chris, you say yourself that you plan to let people subscribe to searches which are known to have crappy results for perhaps the majority of hits, and there is no automatic way of rectifying that.

Imagine a search for the two words Software License, and that your search does an OR search with stemming etc.
Now, in a large corpus of documents scoring will see to it that the first page is probably filled with hits relevant to both words, but if you try to match smaller batches of documents, say all new docs every hour or day, you may very well be in a situation where no docs are relevant, but you still find plenty of matches for only Software or only License/licenses/licensing. This would be slightly better with an AND search, but it would not be usable for alerting unless the query itself was a phrase query for "Software License"

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. mai 2012, at 22:55, Otis Gospodnetic wrote:

> Hi Chris,
>
> I think there is some confusion here.
> When people say things about relevance scores they talk about comparing them across queries.
> What you have is a different situation, or at least a situation that lends itself to working around this, at least partially.
>
> You have N users.
> Each user enters N queries.
>
> You have incoming stream of documents that you wan to match against all users' saved queries.
>
> When a new document is matched you could:
> 1) send it to user right away
> 2) store it somewhere as a document that matched a query Q and send all matches to users periodically.
>
> If you go with 1) then either you send all matches to users, or you introduce the notion of the score thresholds.  That's bad for the reason you already identified.
> If you go with 2) then you have the option of batching up matches for each saved query and alerting users only every N hours.  Then, you could introduce logic that says:
> "If there are >N matches for query Q then remove all matches with score <S"
> "If there are >M matches for query Q, then remove all matches with score <R"
> "If there are <Z matches for query Q, then keep all matches"
> ...
>
> Maybe you can turn this into a feature in your product ;)
>
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 
>
>
>
>> ________________________________
>> From: Chris Harris <[hidden email]>
>> To: [hidden email]
>> Sent: Wednesday, May 9, 2012 4:50 AM
>> Subject: Can one determine which results are "good enough" to alert users about?
>>
>> I'm trying to think through a Solr-based email alerting engine that
>> would have the following properties:
>>
>> 1. Users can enter queries they want to be alerted on, and the syntax
>> for alert queries should be the same syntax as my regular solr
>> (dismax) queries.
>>
>> 1a. Corollary: Because of not just tf-idf but also dismax pf and qf
>> boosting, this implies that the set of documents that match a given
>> query will vary widely in quality; the first page of search results
>> will be quite good, but the last page won't be worth looking at.
>>
>> 2. The email alerting engine shouldn't bother alerting people about
>> *all* new results for a given query; in particular it should avoid the
>> poor-quality tail of results and just alert on "the good stuff".
>>
>> Unfortunately, my current understanding of Solr/Lucene is that there's
>> not a good automatic way to partition the set of query results into
>> "good stuff" vs "not good stuff". The main option I know of is to
>> filter out documents below a certain score threshold, but if you
>> search the Lucene/Solr mailing lists, people will advise that this is
>> unlikely to be fruitful. (It ultimately boils down to how Lucene/Solr
>> scores wasn't especially designed to mean anything as absolute
>> numbers, only when compared to other scores.)
>>
>> This makes me wonder if there's something wrong with my original
>> requirements, or whether people have thought of some other way to
>> approach this.
>>
>> Interestingly, Google appears to have solved this at least to some
>> degree with Google Alerts (http://www.google.com/alerts); there you
>> can choose to receive "Only the best results" rather than "All the
>> results". I'm not clear how they determine which results are "best",
>> but their UI certainly implies they've come up with some scheme for
>> it.
>>
>> Thanks,
>> Chris
>>
>>