Facets with an IDF concept

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Facets with an IDF concept

Asif Rahman
Hi all,

We have an index of news articles that are tagged with news topics.
Currently, we use solr facets to see which topics are popular for a given
query or time period.  I'd like to apply the concept of IDF to the facet
counts so as to penalize the topics that occur broadly through our index.
I've begun to write custom facet component that applies the IDF to the facet
counts, but I also wanted to check if anyone has experience using facets in
this way.

Thanks,

Asif
Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Asif Rahman
Hi again,

I guess nobody has used facets in the way I described below before.  Do any
of the experts have any ideas as to how to do this efficiently and
correctly?  Any thoughts would be greatly appreciated.

Thanks,

Asif

On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman <[hidden email]> wrote:

> Hi all,
>
> We have an index of news articles that are tagged with news topics.
> Currently, we use solr facets to see which topics are popular for a given
> query or time period.  I'd like to apply the concept of IDF to the facet
> counts so as to penalize the topics that occur broadly through our index.
> I've begun to write custom facet component that applies the IDF to the facet
> counts, but I also wanted to check if anyone has experience using facets in
> this way.
>
> Thanks,
>
> Asif
>



--
Asif Rahman
Lead Engineer - NewsCred
[hidden email]
http://platform.newscred.com
Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Kent Fitch
Hi Asif,

I was holding back because we have a similar problem, but we're not
sure how best to approach it, or even whether approaching it at all is
the right thing to do.

Background:
- large index (~35m documents)
- about 120k on these include full text book contents plus metadata,
the rest are just metadata
- we plan to increase number of full text books to around 1m, number
of records will greatly increase

We've found that because of the sheer volume of content in full text,
we get lots of results in full text of very low relevance. The Lucene
relevance ranking works wonderfully to "hide" these way down the list,
and when these are the only results at all, the user may be delighted
to find obscure hits.

But when you search for, say : soldier of fortune : one of the 55k+
results is Huck Finn, with 4 "soldier(s)" and 6 "fortunes", but it
probably isn't relevant.  The searcher will find it in the result
sets, but should the author, subject, dates, formats etc (our facets)
of Huck Finn be contributing to the facets shown to the user as
equally as, say, the top 500 results?  Maybe, but perhaps they are
"diluting" the value of facets contributed by the more relevant
results.

So, we are considering restricting the contents of the result bit set
used for faceting to exclude results with a very very low score (with
our own QueryComponent).  But there are problems:

- what's a low score?  How will a low score threshold vary across
queries? (Or should we use a rank cutoff instead, which is much more
expensive to compute, or some combo that works with results that only
have very low relevance results?)

- should we do this for all facets, or just some (where the less
relevant results seem particularly annoying, as they can "mask" facets
from the most relevant results - the authors, years and subjects we
have full text for are not representative of the whole corpus)

- if a searcher pages through to the 1000th result page, down to these
less relevant results, should we somehow include these results in the
facets we show?

sorry, only more questions!

Regards,

Kent Fitch

On Tue, Jun 23, 2009 at 5:58 PM, Asif Rahman<[hidden email]> wrote:

> Hi again,
>
> I guess nobody has used facets in the way I described below before.  Do any
> of the experts have any ideas as to how to do this efficiently and
> correctly?  Any thoughts would be greatly appreciated.
>
> Thanks,
>
> Asif
>
> On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman <[hidden email]> wrote:
>
>> Hi all,
>>
>> We have an index of news articles that are tagged with news topics.
>> Currently, we use solr facets to see which topics are popular for a given
>> query or time period.  I'd like to apply the concept of IDF to the facet
>> counts so as to penalize the topics that occur broadly through our index.
>> I've begun to write custom facet component that applies the IDF to the facet
>> counts, but I also wanted to check if anyone has experience using facets in
>> this way.
>>
>> Thanks,
>>
>> Asif
>>
>
>
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> [hidden email]
> http://platform.newscred.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Grant Ingersoll-2
In reply to this post by Asif Rahman

On Jun 23, 2009, at 3:58 AM, Asif Rahman wrote:

> Hi again,
>
> I guess nobody has used facets in the way I described below before.  
> Do any
> of the experts have any ideas as to how to do this efficiently and
> correctly?  Any thoughts would be greatly appreciated.
>
> Thanks,
>
> Asif
>
> On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman <[hidden email]>  
> wrote:
>
>> Hi all,
>>
>> We have an index of news articles that are tagged with news topics.
>> Currently, we use solr facets to see which topics are popular for a  
>> given
>> query or time period.  I'd like to apply the concept of IDF to the  
>> facet
>> counts so as to penalize the topics that occur broadly through our  
>> index.
>> I've begun to write custom facet component that applies the IDF to  
>> the facet
>> counts, but I also wanted to check if anyone has experience using  
>> facets in
>> this way.


I'm not sure I'm following.  Would you be faceting on one field, but  
using the DF from some other field?  Faceting is already a count of  
all the documents that contain the term on a given field for that  
search.  If I'm understanding, you would still do the typical  
faceting, but then rerank by the global DF values, right?

Backing up, what is the problem you are seeing that you are trying to  
solve?

I think you could do this, but you'd have to hook it in yourself.  By  
penalize, do you mean remove, or just have them in the sort?  
Generally speaking, looking up the DF value can be expensive,  
especially if you do a lot of skipping around.  I don't know how  
pluggable the sort capabilities are for faceting, but that might be  
the place to start if you are just looking at the sorting options.



--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Asif Rahman
In reply to this post by Kent Fitch
Hi Kent,

Your problem is close cousin of the problem that we're tackling.  We have
experience the same problem as you when calculating facets on MoreLikeThis
queries, since those queries tend to match a lot of documents.  We used one
of the solutions that you mentioned, rank cutoff, to solve it.  We first run
the MoreLikeThis query, then use the top N documents' unique ids as a filter
query for a second query.  The performance is still acceptable, however our
index size is smaller than yours by an order of magnitude.

Regards,

Asif

On Tue, Jun 23, 2009 at 10:34 AM, Kent Fitch <[hidden email]> wrote:

> Hi Asif,
>
> I was holding back because we have a similar problem, but we're not
> sure how best to approach it, or even whether approaching it at all is
> the right thing to do.
>
> Background:
> - large index (~35m documents)
> - about 120k on these include full text book contents plus metadata,
> the rest are just metadata
> - we plan to increase number of full text books to around 1m, number
> of records will greatly increase
>
> We've found that because of the sheer volume of content in full text,
> we get lots of results in full text of very low relevance. The Lucene
> relevance ranking works wonderfully to "hide" these way down the list,
> and when these are the only results at all, the user may be delighted
> to find obscure hits.
>
> But when you search for, say : soldier of fortune : one of the 55k+
> results is Huck Finn, with 4 "soldier(s)" and 6 "fortunes", but it
> probably isn't relevant.  The searcher will find it in the result
> sets, but should the author, subject, dates, formats etc (our facets)
> of Huck Finn be contributing to the facets shown to the user as
> equally as, say, the top 500 results?  Maybe, but perhaps they are
> "diluting" the value of facets contributed by the more relevant
> results.
>
> So, we are considering restricting the contents of the result bit set
> used for faceting to exclude results with a very very low score (with
> our own QueryComponent).  But there are problems:
>
> - what's a low score?  How will a low score threshold vary across
> queries? (Or should we use a rank cutoff instead, which is much more
> expensive to compute, or some combo that works with results that only
> have very low relevance results?)
>
> - should we do this for all facets, or just some (where the less
> relevant results seem particularly annoying, as they can "mask" facets
> from the most relevant results - the authors, years and subjects we
> have full text for are not representative of the whole corpus)
>
> - if a searcher pages through to the 1000th result page, down to these
> less relevant results, should we somehow include these results in the
> facets we show?
>
> sorry, only more questions!
>
> Regards,
>
> Kent Fitch
>
> On Tue, Jun 23, 2009 at 5:58 PM, Asif Rahman<[hidden email]> wrote:
> > Hi again,
> >
> > I guess nobody has used facets in the way I described below before.  Do
> any
> > of the experts have any ideas as to how to do this efficiently and
> > correctly?  Any thoughts would be greatly appreciated.
> >
> > Thanks,
> >
> > Asif
> >
> > On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman <[hidden email]> wrote:
> >
> >> Hi all,
> >>
> >> We have an index of news articles that are tagged with news topics.
> >> Currently, we use solr facets to see which topics are popular for a
> given
> >> query or time period.  I'd like to apply the concept of IDF to the facet
> >> counts so as to penalize the topics that occur broadly through our
> index.
> >> I've begun to write custom facet component that applies the IDF to the
> facet
> >> counts, but I also wanted to check if anyone has experience using facets
> in
> >> this way.
> >>
> >> Thanks,
> >>
> >> Asif
> >>
> >
> >
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > [hidden email]
> > http://platform.newscred.com
> >
>



--
Asif Rahman
Lead Engineer - NewsCred
[hidden email]
http://platform.newscred.com
Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Asif Rahman
In reply to this post by Grant Ingersoll-2
Hi Grant,

I'll give a real life example of the problem that we are trying to solve.

We index a large number of current news articles on a continuing basis.  We
tag these articles with news topics (e.g. Barack Obama, Iran, etc.).  We
then use these tags to facet our queries.  For example, we might issue a
query for all articles in the last 24 hours.  The facets would then tell us
which news topics have been written about the most in that period.  The
problem is that "Barack Obama", for example, is always written about in high
frequency, as opposed to "Iran" which is currently very hot in the news, but
which has not always been the case.  In this case, we'd like to see "Iran"
show up higher than "Barack Obama" in the facet results.

To me, this seems identical to the tf-idf scoring expression that is used in
normal search.  The facet count is analogous to the tf and I can access the
facet term idf's through the Similarity API.

Is my reasoning sound?  Can you provide any guidance as to the best way to
implement this?

Thanks for your help,

Asif


On Tue, Jun 23, 2009 at 1:19 PM, Grant Ingersoll <[hidden email]>wrote:

>
> On Jun 23, 2009, at 3:58 AM, Asif Rahman wrote:
>
>  Hi again,
>>
>> I guess nobody has used facets in the way I described below before.  Do
>> any
>> of the experts have any ideas as to how to do this efficiently and
>> correctly?  Any thoughts would be greatly appreciated.
>>
>> Thanks,
>>
>> Asif
>>
>> On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman <[hidden email]> wrote:
>>
>>  Hi all,
>>>
>>> We have an index of news articles that are tagged with news topics.
>>> Currently, we use solr facets to see which topics are popular for a given
>>> query or time period.  I'd like to apply the concept of IDF to the facet
>>> counts so as to penalize the topics that occur broadly through our index.
>>> I've begun to write custom facet component that applies the IDF to the
>>> facet
>>> counts, but I also wanted to check if anyone has experience using facets
>>> in
>>> this way.
>>>
>>
>
> I'm not sure I'm following.  Would you be faceting on one field, but using
> the DF from some other field?  Faceting is already a count of all the
> documents that contain the term on a given field for that search.  If I'm
> understanding, you would still do the typical faceting, but then rerank by
> the global DF values, right?
>
> Backing up, what is the problem you are seeing that you are trying to
> solve?
>
> I think you could do this, but you'd have to hook it in yourself.  By
> penalize, do you mean remove, or just have them in the sort?  Generally
> speaking, looking up the DF value can be expensive, especially if you do a
> lot of skipping around.  I don't know how pluggable the sort capabilities
> are for faceting, but that might be the place to start if you are just
> looking at the sorting options.
>
>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


--
Asif Rahman
Lead Engineer - NewsCred
[hidden email]
http://platform.newscred.com
Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Ian Holsman (Lists)
Asif Rahman wrote:

> Hi Grant,
>
> I'll give a real life example of the problem that we are trying to solve.
>
> We index a large number of current news articles on a continuing basis.  We
> tag these articles with news topics (e.g. Barack Obama, Iran, etc.).  We
> then use these tags to facet our queries.  For example, we might issue a
> query for all articles in the last 24 hours.  The facets would then tell us
> which news topics have been written about the most in that period.  The
> problem is that "Barack Obama", for example, is always written about in high
> frequency, as opposed to "Iran" which is currently very hot in the news, but
> which has not always been the case.  In this case, we'd like to see "Iran"
> show up higher than "Barack Obama" in the facet results.
>
>  

your not looking for a IDF based function.
you need to figure out what a 'normal' amount of news flow for a given
topic is and then determine when an abnormal amount is happening.
note.. that an abnormal amount is positive or negative.
we use a similar method to this on http://love.com, so we know for
example something is going on with Ed McMahon as I type.

I wouldn't be looking at using SOLR to do this kind of thing btw. try
something like esper. I think it might hold some promise to this kind of
thing (esper is a open source stream database).

Regards

> To me, this seems identical to the tf-idf scoring expression that is used in
> normal search.  The facet count is analogous to the tf and I can access the
> facet term idf's through the Similarity API.
>
> Is my reasoning sound?  Can you provide any guidance as to the best way to
> implement this?
>
> Thanks for your help,
>
> Asif
>
>
> On Tue, Jun 23, 2009 at 1:19 PM, Grant Ingersoll <[hidden email]>wrote:
>
>  
>> On Jun 23, 2009, at 3:58 AM, Asif Rahman wrote:
>>
>>  Hi again,
>>    
>>> I guess nobody has used facets in the way I described below before.  Do
>>> any
>>> of the experts have any ideas as to how to do this efficiently and
>>> correctly?  Any thoughts would be greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Asif
>>>
>>> On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman <[hidden email]> wrote:
>>>
>>>  Hi all,
>>>      
>>>> We have an index of news articles that are tagged with news topics.
>>>> Currently, we use solr facets to see which topics are popular for a given
>>>> query or time period.  I'd like to apply the concept of IDF to the facet
>>>> counts so as to penalize the topics that occur broadly through our index.
>>>> I've begun to write custom facet component that applies the IDF to the
>>>> facet
>>>> counts, but I also wanted to check if anyone has experience using facets
>>>> in
>>>> this way.
>>>>
>>>>        
>> I'm not sure I'm following.  Would you be faceting on one field, but using
>> the DF from some other field?  Faceting is already a count of all the
>> documents that contain the term on a given field for that search.  If I'm
>> understanding, you would still do the typical faceting, but then rerank by
>> the global DF values, right?
>>
>> Backing up, what is the problem you are seeing that you are trying to
>> solve?
>>
>> I think you could do this, but you'd have to hook it in yourself.  By
>> penalize, do you mean remove, or just have them in the sort?  Generally
>> speaking, looking up the DF value can be expensive, especially if you do a
>> lot of skipping around.  I don't know how pluggable the sort capabilities
>> are for faceting, but that might be the place to start if you are just
>> looking at the sorting options.
>>
>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>    
>
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Grant Ingersoll-2
In reply to this post by Asif Rahman

On Jun 23, 2009, at 8:05 AM, Asif Rahman wrote:

> Hi Grant,
>
> I'll give a real life example of the problem that we are trying to  
> solve.
>
> We index a large number of current news articles on a continuing  
> basis.  We
> tag these articles with news topics (e.g. Barack Obama, Iran,  
> etc.).  We
> then use these tags to facet our queries.  For example, we might  
> issue a
> query for all articles in the last 24 hours.  The facets would then  
> tell us
> which news topics have been written about the most in that period.  
> The
> problem is that "Barack Obama", for example, is always written about  
> in high
> frequency, as opposed to "Iran" which is currently very hot in the  
> news, but
> which has not always been the case.  In this case, we'd like to see  
> "Iran"
> show up higher than "Barack Obama" in the facet results.
>
> To me, this seems identical to the tf-idf scoring expression that is  
> used in
> normal search.  The facet count is analogous to the tf and I can  
> access the
> facet term idf's through the Similarity API.

I'd say faceting is akin to the DF (doc freq) part of search, not TF.  
TF is per document, DF is across all the docs.  Faceting is just  
counting all of docs that contain the various terms in that field  
across the results set.

Regardless of the semantics, it doesn't sound like DF would give you  
what you want.  It could be entirely possible that in some short  
timespan the number of docs on Iran could match up w/ the number on  
Obama (maybe not for that particular example) in which case your "hot"  
item would no longer appear hot.

One idea is that you could take baselines of all the facets nightly  
for that field (via *:* or something) and then you could track the  
trends that way by calculating the diffs.  Of course, you could then  
do this hour to hour and get into all kinds of trend detection stuff.  
In other words, it does seem like it's something you could do with Solr.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

hossman

: Regardless of the semantics, it doesn't sound like DF would give you what you
: want.  It could be entirely possible that in some short timespan the number of
: docs on Iran could match up w/ the number on Obama (maybe not for that
: particular example) in which case your "hot" item would no longer appear hot.

but if hte numbers match up in that timespan then the "hot" item isn't as
"hot" anymore.

Myabe i'm missunderstanding: but it sounds like Asif's question esentailly
boils down to getting facet constraints sorted after using some
normalizing fraction ... the simplest case being the inverse ratio (this
is where i think Asif is comparing it to IDF) of the number of matches for
that facet in some larger docset to the size of the docset-- typically
that docset could be the entire index, but it could also be the same
search over a large window of time.

So if i was doing a news search for all docs in the last 24 hours, I could
multiple each of those facet counts by the ratio of the corrisponding
counts from the past month to the number of articles from the past monght
see how much "hotter" they are in my smaller result set...

current result set facet counts (X)...
  News:1100
  Obama:1000
  Iran:800
  Miley Cyrus:700
  iPod:500

facet counts from the past month (Y), during which type 9000 (Z)
documents were published...
  News:9000
  Obama:7000
  Iran:1000
  Miley Cyrus:4000
  iPod:5000

X*(Z/Y)...
  Iran:7200
  Miley Cyrus:1575
  Obama:1285.7
  News:1100
  iPod:900
 

Doing this in a Solr plugin would be the best way to to this -- because
otherwise your "hot" terms might not even show up in the facet lists.  
any attempt to do it on the client would just be an approximation, and
could easily miss the "hottest" item if it was just below cutoff for hte
number of constraints to be returned.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Otis Gospodnetic-2
In reply to this post by Asif Rahman

Hi,

Hm, I don't think facets (nor pure search/Solr) are the right tool for this job.  I think you have to do what Ian said, which is to compute the baseline for various concepts of interest (Barack Obama and Iran in your example), and then compare.

Look at point #2 on http://www.sematext.com/product-key-phrase-extractor.html .  I think this is what you are after, and you will even see an example that matches yours very closely.  My guess is that's how http://www.google.com/trends/hottrends works, too.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Asif Rahman <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, June 23, 2009 8:05:48 AM
> Subject: Re: Facets with an IDF concept
>
> Hi Grant,
>
> I'll give a real life example of the problem that we are trying to solve.
>
> We index a large number of current news articles on a continuing basis.  We
> tag these articles with news topics (e.g. Barack Obama, Iran, etc.).  We
> then use these tags to facet our queries.  For example, we might issue a
> query for all articles in the last 24 hours.  The facets would then tell us
> which news topics have been written about the most in that period.  The
> problem is that "Barack Obama", for example, is always written about in high
> frequency, as opposed to "Iran" which is currently very hot in the news, but
> which has not always been the case.  In this case, we'd like to see "Iran"
> show up higher than "Barack Obama" in the facet results.
>
> To me, this seems identical to the tf-idf scoring expression that is used in
> normal search.  The facet count is analogous to the tf and I can access the
> facet term idf's through the Similarity API.
>
> Is my reasoning sound?  Can you provide any guidance as to the best way to
> implement this?
>
> Thanks for your help,
>
> Asif
>
>
> On Tue, Jun 23, 2009 at 1:19 PM, Grant Ingersoll wrote:
>
> >
> > On Jun 23, 2009, at 3:58 AM, Asif Rahman wrote:
> >
> >  Hi again,
> >>
> >> I guess nobody has used facets in the way I described below before.  Do
> >> any
> >> of the experts have any ideas as to how to do this efficiently and
> >> correctly?  Any thoughts would be greatly appreciated.
> >>
> >> Thanks,
> >>
> >> Asif
> >>
> >> On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman wrote:
> >>
> >>  Hi all,
> >>>
> >>> We have an index of news articles that are tagged with news topics.
> >>> Currently, we use solr facets to see which topics are popular for a given
> >>> query or time period.  I'd like to apply the concept of IDF to the facet
> >>> counts so as to penalize the topics that occur broadly through our index.
> >>> I've begun to write custom facet component that applies the IDF to the
> >>> facet
> >>> counts, but I also wanted to check if anyone has experience using facets
> >>> in
> >>> this way.
> >>>
> >>
> >
> > I'm not sure I'm following.  Would you be faceting on one field, but using
> > the DF from some other field?  Faceting is already a count of all the
> > documents that contain the term on a given field for that search.  If I'm
> > understanding, you would still do the typical faceting, but then rerank by
> > the global DF values, right?
> >
> > Backing up, what is the problem you are seeing that you are trying to
> > solve?
> >
> > I think you could do this, but you'd have to hook it in yourself.  By
> > penalize, do you mean remove, or just have them in the sort?  Generally
> > speaking, looking up the DF value can be expensive, especially if you do a
> > lot of skipping around.  I don't know how pluggable the sort capabilities
> > are for faceting, but that might be the place to start if you are just
> > looking at the sorting options.
> >
> >
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> > Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> [hidden email]
> http://platform.newscred.com

Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Grant Ingersoll-2
In reply to this post by hossman

On Jun 23, 2009, at 6:23 PM, Chris Hostetter wrote:

>
> : Regardless of the semantics, it doesn't sound like DF would give  
> you what you
> : want.  It could be entirely possible that in some short timespan  
> the number of
> : docs on Iran could match up w/ the number on Obama (maybe not for  
> that
> : particular example) in which case your "hot" item would no longer  
> appear hot.
>
> but if hte numbers match up in that timespan then the "hot" item  
> isn't as
> "hot" anymore.

Not necessarily true.  Consider the case where over the year there are  
50 stories about Obama.  Then, in the span of 5 days, there are 50  
stories about Iran.  Iran, in my view, is still hotter than Obama.  In  
Asif's case, he was suggesting comparing against the global DF.

Not to worry, though, your proposal is much the same as mine, namely  
take a baseline based on some set of docs (I chose *:*, you chose past  
month) and then compare.

>
> Myabe i'm missunderstanding: but it sounds like Asif's question  
> esentailly
> boils down to getting facet constraints sorted after using some
> normalizing fraction ... the simplest case being the inverse ratio  
> (this
> is where i think Asif is comparing it to IDF) of the number of  
> matches for
> that facet in some larger docset to the size of the docset-- typically
> that docset could be the entire index, but it could also be the same
> search over a large window of time.
>
> So if i was doing a news search for all docs in the last 24 hours, I  
> could
> multiple each of those facet counts by the ratio of the corrisponding
> counts from the past month to the number of articles from the past  
> monght
> see how much "hotter" they are in my smaller result set...
>
> current result set facet counts (X)...
>  News:1100
>  Obama:1000
>  Iran:800
>  Miley Cyrus:700
>  iPod:500
>
> facet counts from the past month (Y), during which type 9000 (Z)
> documents were published...
>  News:9000
>  Obama:7000
>  Iran:1000
>  Miley Cyrus:4000
>  iPod:5000
>
> X*(Z/Y)...
>  Iran:7200
>  Miley Cyrus:1575
>  Obama:1285.7
>  News:1100
>  iPod:900
>
>
> Doing this in a Solr plugin would be the best way to to this --  
> because
> otherwise your "hot" terms might not even show up in the facet lists.
> any attempt to do it on the client would just be an approximation, and
> could easily miss the "hottest" item if it was just below cutoff for  
> hte
> number of constraints to be returned.
>
>
> -Hoss
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

wojtekpia
In reply to this post by Asif Rahman
Hi Asif,

Did you end up implementing this as a custom sort order for facets? I'm facing a similar problem, but not related to time. Given 2 terms:
A: appears twice in half the search results
B: appears once in every search result
I think term A is more "interesting". Using facets sorted by frequency, term B is more important (since it shows up first). To me, terms that appear in all documents aren't really that interesting. I'm thinking of using a combination of document count (in the result set, not globally) and term frequency (in the result set, not globally) to come up with a facet sort order.

Wojtek
Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Asif Rahman
Hi Wojtek:

Sorry for the late, late reply.  I haven't implemented this yet, but it is
on the (long) list of my todos.  Have you made any progress?

Asif

On Thu, Aug 13, 2009 at 5:42 PM, wojtekpia <[hidden email]> wrote:

>
> Hi Asif,
>
> Did you end up implementing this as a custom sort order for facets? I'm
> facing a similar problem, but not related to time. Given 2 terms:
> A: appears twice in half the search results
> B: appears once in every search result
> I think term A is more "interesting". Using facets sorted by frequency,
> term
> B is more important (since it shows up first). To me, terms that appear in
> all documents aren't really that interesting. I'm thinking of using a
> combination of document count (in the result set, not globally) and term
> frequency (in the result set, not globally) to come up with a facet sort
> order.
>
> Wojtek
> --
> View this message in context:
> http://www.nabble.com/Facets-with-an-IDF-concept-tp24071160p24959192.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


--
Asif Rahman
Lead Engineer - NewsCred
[hidden email]
http://platform.newscred.com
Reply | Threaded
Open this post in threaded view
|

Re: Facets with an IDF concept

Lance Norskog-2
In Solr a facet is assigned one number: the number of documents in
which it appears. The facets are sorted by that number.  Would your
use case be solved with a second number that is formulated from the
relevance of the associated documents? For example:

   facet relevance = count * sum(scores of documents) with
coefficients for each input?

To do this, for each document counted by the facet, you then have to
find that document in the result list and pull the score. This would
be much slower than the current "count the documents" algorithm. But
if you have limited the document list via filter, this could still be
fast enough for interactive use.

If I wanted to make a tag cloud, this is how I would do it.

On Fri, Oct 9, 2009 at 3:58 PM, Asif Rahman <[hidden email]> wrote:

> Hi Wojtek:
>
> Sorry for the late, late reply.  I haven't implemented this yet, but it is
> on the (long) list of my todos.  Have you made any progress?
>
> Asif
>
> On Thu, Aug 13, 2009 at 5:42 PM, wojtekpia <[hidden email]> wrote:
>
>>
>> Hi Asif,
>>
>> Did you end up implementing this as a custom sort order for facets? I'm
>> facing a similar problem, but not related to time. Given 2 terms:
>> A: appears twice in half the search results
>> B: appears once in every search result
>> I think term A is more "interesting". Using facets sorted by frequency,
>> term
>> B is more important (since it shows up first). To me, terms that appear in
>> all documents aren't really that interesting. I'm thinking of using a
>> combination of document count (in the result set, not globally) and term
>> frequency (in the result set, not globally) to come up with a facet sort
>> order.
>>
>> Wojtek
>> --
>> View this message in context:
>> http://www.nabble.com/Facets-with-an-IDF-concept-tp24071160p24959192.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> [hidden email]
> http://platform.newscred.com
>



--
Lance Norskog
[hidden email]