Revisiting IDF Problems and Index Slices

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Revisiting IDF Problems and Index Slices

mbennett
I'm investigating a problem I bet some of you have hit before, and exploring
several options to address it.  I suspect that this specific IDF scenario is
common enough that it even has a name, though I'm not what it would be
called.

The scenario:

Suppose you have a search application focused on the products that you sell:
* You initially launch with just one product area for your "widget".
* You add a stop words list for really common words, say "a", "the", "and",
etc.

Looking at the search logs you discover:
* Other common English words being used, such as "i", "how" and "can".
These are automatically getting low IDF scores, since they are also common
in your documents, so normally you wouldn't worry.
* But you also notice that the words "widget" is getting a very low score,
since it is mentioned in virtually every document.  In fact, since widget is
more common than the word "can", it is actually getting a *lower* IDF score
than the word "can", giving some very unsatisfying search results.

Some queries include other "good" words like "replace" and "battery", so
those searches are doing OK; those terms are relatively rare, so they
properly get the most weight.

But some searches have nothing but "bad" words, for example "how can I get a
widget ?"
* The word "a" is stopped out.
* The words "how", "can", "i" and "get" are common English words, so they
get a low score.
* And the word "widget", your flagship product, is also getting a low score,
since it's in every document.
So some queries are ENTIRELY composed of "low quality" words, and the search
engine gives somewhat random results.

Some ideas to address this:

Idea 1: Do nothing to IDF and just deploy the content for OTHER products
ASAP; the IDF logic will automatically fix the problem.

With content only about widgets, 100% of the docs have that score.

But if you deploy content for 9 more products, giving a total of 10
products, then only 1/10th of them will tend to have the word "widget".  But
common English words will continue, on average, to appear in almost all
documents, across all product areas.  Therefore they will continue to get
low scores.  So relative to the words "i" and "can", the word "widget" will
automatically rise up out of the IDF muck.

And when content for all 100 of your products are deployed, "widget" will
again have been boosted even further, compared to other common English
words.

Though long term I have faith in this strategy, suppose I only had content
ready for two other products, say "cogs" and "hookas", so the term "widget"
will get a bit of a boost, but not by as much as I'd like.

Idea 2: Compare terms in product "slices" and artificially boost them.

In the Widget content there's references to related accessories like the
wBracket and wCase.

Whereas in the Hooka content, I rarely see mention of Widgets or wCases, but
I do still see the English terms "i" and "can", and lots of stuff about
Hookas.

2a: So I do a histogram for each of the 3 products, and look for terms that
are common in one but not in the others.

2b: Or I do a histogram of all terms, and then a histogram of terms for each
of the 3 product slices.  And I compare those lists.

Of course we don't want to go overboard.  If somebody is in the Widget forum
and asks "how do I change the widget battery?" then, although the term
Widget is more important than "how" or "do", it is NOT as important as
"battery" or "change", so I would not want to boost "widget" too much.

BTW, Solr facets makes this term info relatively easy to gather, thanks for
the pointer Yonik!  1.4 also speeds up Facets.

Idea 3: (possibly in conjunction with 2) Compare terms by product slice, and
do "complementary" IDF boosting.

In other words, it's very common for content on the Widget forum to have the
terms "widget", "wBracket" and "wcase".

But if a document in the "Cog" forum mentions "widget", that's actually
rather interesting.  Mentions of "widgets" in the Cog and Hooka content
should be treated as "important", regardless of how common "widget" is in
the Widget forum.

So we compare the common terms in the Widget, Hooka and Cog areas.  Terms
that are specific to Widgets get a bigger boost in the Cog and Hooka areas.
Cog terms get bigger boosts in Hooka and Widget areas, and Hooka terms get
extra credit in Cogs and Widgets.

This extra boosting could be done in addition to the minor boosting widget
terms get in the Widget area, just to keep them above the common English
terms.

Implementation of ideas 2 and 3:

This "diffing" of IDF by product slices would seem to be a good idea, though
a bit of coding to implement.  I'd think somebody would have done this
already looked into this more formally?

In terms of implementation, you could do an arbitrary cutoff at 100 terms
for each product area, and then just do a Boolean for whether it appears in
the other lists or not.  But after running some tests, common terms are
likely to mentioned in many lists, and it's more a question of where on the
list they were.  For example, if you can interface Cogs and Widgets, then
Widget content will occasionally mention Cogs, and vice versa.

To me this seems like more of a Bayesian or Contingency Table issue, with
some weight given to the actual number of occurrences or the relative
position on lists.  Any of you Brainiacs have specific insights?

There's also a question of scaling the Histogram.  If I compare
term-in-document counts in the Widget area to the entire corpus, then since
there are fewer documents in the Widget area (it's a subset of the total),
then by definition the number of documents with the word "i" in that area
will be lower.  You'd think the relative order wouldn't be affected, on
average, but any calculation using the occurrence counts would need to take
that into account.  In other words, if the word "can" appears more times in
the overall documents than it does in the Widget documents, well is that
just because there are more documents in the overall set, or because "can"
is really more dense.  Simple scaling can probably handle this issue...

Even if I compare Widget to Cog documents, both of which are subsets of the
total, there's likely to be a different number of documents in each area,
and therefore the counts would be different.  So I'll get differences in
"how" and "can", in addition to "widget" and "cog", just because the sample
sizes are different.  Again, scaling or limiting the sample size might fix
that too.  Presumably the ranking of the words "i" and "can", which one is
more popular than the other, would stay the same regardless of sample size
and absolute counts, if I were looking at random English text.  If those
terms change position, then that would be more interesting.

However, I think randomness also plays a role here.  If I do a histogram of
100 Widget documents and only 5 Cog documents, and I notice that the
relative positions of "i" and "can" have changed places, I might be
skeptical that it's a random fluctuation because I only had 5 Cog
documents.  If I added 50 more, would the difference still be there?

Sample size it certainly a well established topic in statistics, though I
haven't see it applied to histograms.

And another odd thing about histograms taken from different sample sizes are
the low frequency terms.  In all search engine histograms you'll see terms
that appear only 1 time, and there'll be a bunch of them.  But presumably if
I doubled the number of documents in the sample, some of those terms might
then appear twice, whereas others would still only appear once.  There are
other breaks in histogram counts, like when it does from 2 to 1, that would
presumably also shift around a bit.  But I'm not sure you can reliably use
sample size ration to scale these things with confidence.  This feels like
Poisson distribution territory...

I suspect all of these questions have also been examined in detail before; I
need to dig up my copy of that "Long Tail" book...  Wondering if anybody has
specific links or verbiage / author names to suggest?

There are certainly other ideas for indirectly addressing IDF problem by
using other methods to compensate:

4: You could overlay user reviews / popularity or click-through rates or
other social factors

5: Use flexible proximity searches, in addition to phrase matching, to boost
certain matches.  Verity used to call this the "near" operator, Lucene/Solr
talk about span queries and "phrase slop".

So in a question containing "how can I", documents with that exact phrase
get a substantial boost.  But documents where "how", "can" and "I" appear
within 5 words of each other in any order get also get a small boost.  It's
been suggested that this can help with Q&A type applications.

6: FAST's doc mentions that verbs can be more important in question and
answer type applications.  Normally nouns are said to contain 60-70% of the
relevance of a query (reference?), but their linguistics guide suggests
verbs are more important in Q and A apps.

7: Another idea would be to boost word pairs using composite N-Gram tokens,
what Lucene and Solr call "shingles".

Using my previous examples:
Question: "How can I get a widget ?"
    Normal index: "how", "can", "i", "get", "widget"
    Shingles: "how_can", "can_i", "i_get", "get_widget"
Question: "How do I change the widget battery?"
    Normal index: "how", "do", "i", "change", "widget", "battery"
    Shingles: "how_do", "do_i", "i_change", "change_widget",
"widget_battery"

And both sets of tokens would be added to the index, in different fields,
and with possibly different boosts.

The idea being that tokens like "can_i" might be somewhat more statistically
significant than "can" and "i" by themselves.  So at least in the first
question, which has all low quality words, the "can_i" might help a bit.

I'd appreciate any comments on these ideas from y'all, or perhaps names of
specific algorithms / authors.

--
Mark Bennett / New Idea Engineering, Inc. / [hidden email]
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
Reply | Threaded
Open this post in threaded view
|

Re: Revisiting IDF Problems and Index Slices

Otis Gospodnetic-2
As soon as I started reading your message I started thinking "common
grams", so that is what I would try first, esp. since somebody already
did the work of porting that from Nutch to Solr (see Solr JIRA).
 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: Mark Bennett <[hidden email]>
> To: [hidden email]
> Sent: Thursday, August 6, 2009 2:27:52 PM
> Subject: Revisiting IDF Problems and Index Slices
>
> I'm investigating a problem I bet some of you have hit before, and exploring
> several options to address it.  I suspect that this specific IDF scenario is
> common enough that it even has a name, though I'm not what it would be
> called.
>
> The scenario:
>
> Suppose you have a search application focused on the products that you sell:
> * You initially launch with just one product area for your "widget".
> * You add a stop words list for really common words, say "a", "the", "and",
> etc.
>
> Looking at the search logs you discover:
> * Other common English words being used, such as "i", "how" and "can".
> These are automatically getting low IDF scores, since they are also common
> in your documents, so normally you wouldn't worry.
> * But you also notice that the words "widget" is getting a very low score,
> since it is mentioned in virtually every document.  In fact, since widget is
> more common than the word "can", it is actually getting a *lower* IDF score
> than the word "can", giving some very unsatisfying search results.
>
> Some queries include other "good" words like "replace" and "battery", so
> those searches are doing OK; those terms are relatively rare, so they
> properly get the most weight.
>
> But some searches have nothing but "bad" words, for example "how can I get a
> widget ?"
> * The word "a" is stopped out.
> * The words "how", "can", "i" and "get" are common English words, so they
> get a low score.
> * And the word "widget", your flagship product, is also getting a low score,
> since it's in every document.
> So some queries are ENTIRELY composed of "low quality" words, and the search
> engine gives somewhat random results.
>
> Some ideas to address this:
>
> Idea 1: Do nothing to IDF and just deploy the content for OTHER products
> ASAP; the IDF logic will automatically fix the problem.
>
> With content only about widgets, 100% of the docs have that score.
>
> But if you deploy content for 9 more products, giving a total of 10
> products, then only 1/10th of them will tend to have the word "widget".  But
> common English words will continue, on average, to appear in almost all
> documents, across all product areas.  Therefore they will continue to get
> low scores.  So relative to the words "i" and "can", the word "widget" will
> automatically rise up out of the IDF muck.
>
> And when content for all 100 of your products are deployed, "widget" will
> again have been boosted even further, compared to other common English
> words.
>
> Though long term I have faith in this strategy, suppose I only had content
> ready for two other products, say "cogs" and "hookas", so the term "widget"
> will get a bit of a boost, but not by as much as I'd like.
>
> Idea 2: Compare terms in product "slices" and artificially boost them.
>
> In the Widget content there's references to related accessories like the
> wBracket and wCase.
>
> Whereas in the Hooka content, I rarely see mention of Widgets or wCases, but
> I do still see the English terms "i" and "can", and lots of stuff about
> Hookas.
>
> 2a: So I do a histogram for each of the 3 products, and look for terms that
> are common in one but not in the others.
>
> 2b: Or I do a histogram of all terms, and then a histogram of terms for each
> of the 3 product slices.  And I compare those lists.
>
> Of course we don't want to go overboard.  If somebody is in the Widget forum
> and asks "how do I change the widget battery?" then, although the term
> Widget is more important than "how" or "do", it is NOT as important as
> "battery" or "change", so I would not want to boost "widget" too much.
>
> BTW, Solr facets makes this term info relatively easy to gather, thanks for
> the pointer Yonik!  1.4 also speeds up Facets.
>
> Idea 3: (possibly in conjunction with 2) Compare terms by product slice, and
> do "complementary" IDF boosting.
>
> In other words, it's very common for content on the Widget forum to have the
> terms "widget", "wBracket" and "wcase".
>
> But if a document in the "Cog" forum mentions "widget", that's actually
> rather interesting.  Mentions of "widgets" in the Cog and Hooka content
> should be treated as "important", regardless of how common "widget" is in
> the Widget forum.
>
> So we compare the common terms in the Widget, Hooka and Cog areas.  Terms
> that are specific to Widgets get a bigger boost in the Cog and Hooka areas.
> Cog terms get bigger boosts in Hooka and Widget areas, and Hooka terms get
> extra credit in Cogs and Widgets.
>
> This extra boosting could be done in addition to the minor boosting widget
> terms get in the Widget area, just to keep them above the common English
> terms.
>
> Implementation of ideas 2 and 3:
>
> This "diffing" of IDF by product slices would seem to be a good idea, though
> a bit of coding to implement.  I'd think somebody would have done this
> already looked into this more formally?
>
> In terms of implementation, you could do an arbitrary cutoff at 100 terms
> for each product area, and then just do a Boolean for whether it appears in
> the other lists or not.  But after running some tests, common terms are
> likely to mentioned in many lists, and it's more a question of where on the
> list they were.  For example, if you can interface Cogs and Widgets, then
> Widget content will occasionally mention Cogs, and vice versa.
>
> To me this seems like more of a Bayesian or Contingency Table issue, with
> some weight given to the actual number of occurrences or the relative
> position on lists.  Any of you Brainiacs have specific insights?
>
> There's also a question of scaling the Histogram.  If I compare
> term-in-document counts in the Widget area to the entire corpus, then since
> there are fewer documents in the Widget area (it's a subset of the total),
> then by definition the number of documents with the word "i" in that area
> will be lower.  You'd think the relative order wouldn't be affected, on
> average, but any calculation using the occurrence counts would need to take
> that into account.  In other words, if the word "can" appears more times in
> the overall documents than it does in the Widget documents, well is that
> just because there are more documents in the overall set, or because "can"
> is really more dense.  Simple scaling can probably handle this issue...
>
> Even if I compare Widget to Cog documents, both of which are subsets of the
> total, there's likely to be a different number of documents in each area,
> and therefore the counts would be different.  So I'll get differences in
> "how" and "can", in addition to "widget" and "cog", just because the sample
> sizes are different.  Again, scaling or limiting the sample size might fix
> that too.  Presumably the ranking of the words "i" and "can", which one is
> more popular than the other, would stay the same regardless of sample size
> and absolute counts, if I were looking at random English text.  If those
> terms change position, then that would be more interesting.
>
> However, I think randomness also plays a role here.  If I do a histogram of
> 100 Widget documents and only 5 Cog documents, and I notice that the
> relative positions of "i" and "can" have changed places, I might be
> skeptical that it's a random fluctuation because I only had 5 Cog
> documents.  If I added 50 more, would the difference still be there?
>
> Sample size it certainly a well established topic in statistics, though I
> haven't see it applied to histograms.
>
> And another odd thing about histograms taken from different sample sizes are
> the low frequency terms.  In all search engine histograms you'll see terms
> that appear only 1 time, and there'll be a bunch of them.  But presumably if
> I doubled the number of documents in the sample, some of those terms might
> then appear twice, whereas others would still only appear once.  There are
> other breaks in histogram counts, like when it does from 2 to 1, that would
> presumably also shift around a bit.  But I'm not sure you can reliably use
> sample size ration to scale these things with confidence.  This feels like
> Poisson distribution territory...
>
> I suspect all of these questions have also been examined in detail before; I
> need to dig up my copy of that "Long Tail" book...  Wondering if anybody has
> specific links or verbiage / author names to suggest?
>
> There are certainly other ideas for indirectly addressing IDF problem by
> using other methods to compensate:
>
> 4: You could overlay user reviews / popularity or click-through rates or
> other social factors
>
> 5: Use flexible proximity searches, in addition to phrase matching, to boost
> certain matches.  Verity used to call this the "near" operator, Lucene/Solr
> talk about span queries and "phrase slop".
>
> So in a question containing "how can I", documents with that exact phrase
> get a substantial boost.  But documents where "how", "can" and "I" appear
> within 5 words of each other in any order get also get a small boost.  It's
> been suggested that this can help with Q&A type applications.
>
> 6: FAST's doc mentions that verbs can be more important in question and
> answer type applications.  Normally nouns are said to contain 60-70% of the
> relevance of a query (reference?), but their linguistics guide suggests
> verbs are more important in Q and A apps.
>
> 7: Another idea would be to boost word pairs using composite N-Gram tokens,
> what Lucene and Solr call "shingles".
>
> Using my previous examples:
> Question: "How can I get a widget ?"
>     Normal index: "how", "can", "i", "get", "widget"
>     Shingles: "how_can", "can_i", "i_get", "get_widget"
> Question: "How do I change the widget battery?"
>     Normal index: "how", "do", "i", "change", "widget", "battery"
>     Shingles: "how_do", "do_i", "i_change", "change_widget",
> "widget_battery"
>
> And both sets of tokens would be added to the index, in different fields,
> and with possibly different boosts.
>
> The idea being that tokens like "can_i" might be somewhat more statistically
> significant than "can" and "i" by themselves.  So at least in the first
> question, which has all low quality words, the "can_i" might help a bit.
>
> I'd appreciate any comments on these ideas from y'all, or perhaps names of
> specific algorithms / authors.
>
> --
> Mark Bennett / New Idea Engineering, Inc. / [hidden email]
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Reply | Threaded
Open this post in threaded view
|

Re: Revisiting IDF Problems and Index Slices

Grant Ingersoll-2
In reply to this post by mbennett
Some thoughts below, sorry for the late reply...

On Aug 6, 2009, at 2:27 PM, Mark Bennett wrote:

> I'm investigating a problem I bet some of you have hit before, and  
> exploring
> several options to address it.  I suspect that this specific IDF  
> scenario is
> common enough that it even has a name, though I'm not what it would be
> called.
>
> The scenario:
>
> Suppose you have a search application focused on the products that  
> you sell:
> * You initially launch with just one product area for your "widget".
> * You add a stop words list for really common words, say "a", "the",  
> "and",
> etc.
>
> Looking at the search logs you discover:
> * Other common English words being used, such as "i", "how" and "can".
> These are automatically getting low IDF scores, since they are also  
> common
> in your documents, so normally you wouldn't worry.
> * But you also notice that the words "widget" is getting a very low  
> score,
> since it is mentioned in virtually every document.  In fact, since  
> widget is
> more common than the word "can", it is actually getting a *lower*  
> IDF score
> than the word "can", giving some very unsatisfying search results.
>
> Some queries include other "good" words like "replace" and  
> "battery", so
> those searches are doing OK; those terms are relatively rare, so they
> properly get the most weight.
>
> But some searches have nothing but "bad" words, for example "how can  
> I get a
> widget ?"
> * The word "a" is stopped out.
> * The words "how", "can", "i" and "get" are common English words, so  
> they
> get a low score.
> * And the word "widget", your flagship product, is also getting a  
> low score,
> since it's in every document.
> So some queries are ENTIRELY composed of "low quality" words, and  
> the search
> engine gives somewhat random results.
>
> Some ideas to address this:
>
> Idea 1: Do nothing to IDF and just deploy the content for OTHER  
> products
> ASAP; the IDF logic will automatically fix the problem.
>
> With content only about widgets, 100% of the docs have that score.
>
> But if you deploy content for 9 more products, giving a total of 10
> products, then only 1/10th of them will tend to have the word  
> "widget".  But
> common English words will continue, on average, to appear in almost  
> all
> documents, across all product areas.  Therefore they will continue  
> to get
> low scores.  So relative to the words "i" and "can", the word  
> "widget" will
> automatically rise up out of the IDF muck.
>
> And when content for all 100 of your products are deployed, "widget"  
> will
> again have been boosted even further, compared to other common English
> words.
>
> Though long term I have faith in this strategy, suppose I only had  
> content
> ready for two other products, say "cogs" and "hookas", so the term  
> "widget"
> will get a bit of a boost, but not by as much as I'd like.
>
> Idea 2: Compare terms in product "slices" and artificially boost them.
>
> In the Widget content there's references to related accessories like  
> the
> wBracket and wCase.
>
> Whereas in the Hooka content, I rarely see mention of Widgets or  
> wCases, but
> I do still see the English terms "i" and "can", and lots of stuff  
> about
> Hookas.
>
> 2a: So I do a histogram for each of the 3 products, and look for  
> terms that
> are common in one but not in the others.
>
> 2b: Or I do a histogram of all terms, and then a histogram of terms  
> for each
> of the 3 product slices.  And I compare those lists.
>
> Of course we don't want to go overboard.  If somebody is in the  
> Widget forum
> and asks "how do I change the widget battery?" then, although the term
> Widget is more important than "how" or "do", it is NOT as important as
> "battery" or "change", so I would not want to boost "widget" too much.
>
> BTW, Solr facets makes this term info relatively easy to gather,  
> thanks for
> the pointer Yonik!  1.4 also speeds up Facets.
>
> Idea 3: (possibly in conjunction with 2) Compare terms by product  
> slice, and
> do "complementary" IDF boosting.
>
> In other words, it's very common for content on the Widget forum to  
> have the
> terms "widget", "wBracket" and "wcase".
>
> But if a document in the "Cog" forum mentions "widget", that's  
> actually
> rather interesting.  Mentions of "widgets" in the Cog and Hooka  
> content
> should be treated as "important", regardless of how common "widget"  
> is in
> the Widget forum.
>
> So we compare the common terms in the Widget, Hooka and Cog areas.  
> Terms
> that are specific to Widgets get a bigger boost in the Cog and Hooka  
> areas.
> Cog terms get bigger boosts in Hooka and Widget areas, and Hooka  
> terms get
> extra credit in Cogs and Widgets.
>
> This extra boosting could be done in addition to the minor boosting  
> widget
> terms get in the Widget area, just to keep them above the common  
> English
> terms.
>
> Implementation of ideas 2 and 3:
>
> This "diffing" of IDF by product slices would seem to be a good  
> idea, though
> a bit of coding to implement.  I'd think somebody would have done this
> already looked into this more formally?
>
> In terms of implementation, you could do an arbitrary cutoff at 100  
> terms
> for each product area, and then just do a Boolean for whether it  
> appears in
> the other lists or not.  But after running some tests, common terms  
> are
> likely to mentioned in many lists, and it's more a question of where  
> on the
> list they were.  For example, if you can interface Cogs and Widgets,  
> then
> Widget content will occasionally mention Cogs, and vice versa.
>
> To me this seems like more of a Bayesian or Contingency Table issue,  
> with
> some weight given to the actual number of occurrences or the relative
> position on lists.  Any of you Brainiacs have specific insights?
>
> There's also a question of scaling the Histogram.  If I compare
> term-in-document counts in the Widget area to the entire corpus,  
> then since
> there are fewer documents in the Widget area (it's a subset of the  
> total),
> then by definition the number of documents with the word "i" in that  
> area
> will be lower.  You'd think the relative order wouldn't be affected,  
> on
> average, but any calculation using the occurrence counts would need  
> to take
> that into account.  In other words, if the word "can" appears more  
> times in
> the overall documents than it does in the Widget documents, well is  
> that
> just because there are more documents in the overall set, or because  
> "can"
> is really more dense.  Simple scaling can probably handle this  
> issue...
>
> Even if I compare Widget to Cog documents, both of which are subsets  
> of the
> total, there's likely to be a different number of documents in each  
> area,
> and therefore the counts would be different.  So I'll get  
> differences in
> "how" and "can", in addition to "widget" and "cog", just because the  
> sample
> sizes are different.  Again, scaling or limiting the sample size  
> might fix
> that too.  Presumably the ranking of the words "i" and "can", which  
> one is
> more popular than the other, would stay the same regardless of  
> sample size
> and absolute counts, if I were looking at random English text.  If  
> those
> terms change position, then that would be more interesting.
>
> However, I think randomness also plays a role here.  If I do a  
> histogram of
> 100 Widget documents and only 5 Cog documents, and I notice that the
> relative positions of "i" and "can" have changed places, I might be
> skeptical that it's a random fluctuation because I only had 5 Cog
> documents.  If I added 50 more, would the difference still be there?
>
> Sample size it certainly a well established topic in statistics,  
> though I
> haven't see it applied to histograms.
>
> And another odd thing about histograms taken from different sample  
> sizes are
> the low frequency terms.  In all search engine histograms you'll see  
> terms
> that appear only 1 time, and there'll be a bunch of them.  But  
> presumably if
> I doubled the number of documents in the sample, some of those terms  
> might
> then appear twice, whereas others would still only appear once.  
> There are
> other breaks in histogram counts, like when it does from 2 to 1,  
> that would
> presumably also shift around a bit.  But I'm not sure you can  
> reliably use
> sample size ration to scale these things with confidence.  This  
> feels like
> Poisson distribution territory...
>
> I suspect all of these questions have also been examined in detail  
> before; I
> need to dig up my copy of that "Long Tail" book...  Wondering if  
> anybody has
> specific links or verbiage / author names to suggest?
>
> There are certainly other ideas for indirectly addressing IDF  
> problem by
> using other methods to compensate:
>
> 4: You could overlay user reviews / popularity or click-through  
> rates or
> other social factors
>
> 5: Use flexible proximity searches, in addition to phrase matching,  
> to boost
> certain matches.  Verity used to call this the "near" operator,  
> Lucene/Solr
> talk about span queries and "phrase slop".
>
> So in a question containing "how can I", documents with that exact  
> phrase
> get a substantial boost.  But documents where "how", "can" and "I"  
> appear
> within 5 words of each other in any order get also get a small  
> boost.  It's
> been suggested that this can help with Q&A type applications.
>
> 6: FAST's doc mentions that verbs can be more important in question  
> and
> answer type applications.  Normally nouns are said to contain 60-70%  
> of the
> relevance of a query (reference?), but their linguistics guide  
> suggests
> verbs are more important in Q and A apps.
>
> 7: Another idea would be to boost word pairs using composite N-Gram  
> tokens,
> what Lucene and Solr call "shingles".
>
> Using my previous examples:
> Question: "How can I get a widget ?"
>    Normal index: "how", "can", "i", "get", "widget"
>    Shingles: "how_can", "can_i", "i_get", "get_widget"
> Question: "How do I change the widget battery?"
>    Normal index: "how", "do", "i", "change", "widget", "battery"
>    Shingles: "how_do", "do_i", "i_change", "change_widget",
> "widget_battery"
>
> And both sets of tokens would be added to the index, in different  
> fields,
> and with possibly different boosts.
>
> The idea being that tokens like "can_i" might be somewhat more  
> statistically
> significant than "can" and "i" by themselves.  So at least in the  
> first
> question, which has all low quality words, the "can_i" might help a  
> bit.
>
> I'd appreciate any comments on these ideas from y'all, or perhaps  
> names of
> specific algorithms / authors.
>

Other ideas:

-Doc Analysis: Boost documents that have Widget in them.
-Query Analysis:  Boost the Widget query term by using some  
intelligent query analysis to detect when certain keywords are used
-Payloads.  If you know Widget is important ahead of time (through  
other means), give it a payload and use the likes of the  
BoostingFunctionTermQuery instead of a regular Term Query.
- Query Elevation Component - sometimes it may just make sense to make  
some results be the top results regardless of the scores.
- While it is good to keep stopwords in the index, it isn't always  
good to keep them in the query.  Again, query analysis can be employed  
to determine what words to keep and what to get rid of.
- Override Similarity to downplay the importance of IDF - this is  
likely a rather hard thing to get right


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search