n-gram over-representation?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

n-gram over-representation?

Drew Farris
I have a collection of about 800k bigrams from a corpus of 3.7m
documents that I'm in the process of working with. I'm looking to
determine an appropriate subset of these to use both as features for
both an ML and an IR application. Specifically I'm considering
white-listing a subset of these to use as features when building a
classifier and separately as terms when building an index and doing
query parsing. As a part of the earlier collocation discussion Ted
mentioned that tests for over-representation could be used to identify
dubious members of such a set.

Does anyone have any pointers to discussions of how such a test could
be implemented?

Thanks,

Drew
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

kkrugler

On Feb 16, 2010, at 8:28am, Drew Farris wrote:

> I have a collection of about 800k bigrams from a corpus of 3.7m
> documents that I'm in the process of working with. I'm looking to
> determine an appropriate subset of these to use both as features for
> both an ML and an IR application. Specifically I'm considering
> white-listing a subset of these to use as features when building a
> classifier and separately as terms when building an index and doing
> query parsing. As a part of the earlier collocation discussion Ted
> mentioned that tests for over-representation could be used to identify
> dubious members of such a set.
>
> Does anyone have any pointers to discussions of how such a test could
> be implemented?

Wouldn't simple df (document frequency) be a reasonable metric for this?

 From what I've seen in Lucene indexes, a ranked list of terms (by df)  
has a pretty sharp elbow that you could use as the cut-off point.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Jason Rennie-2
In reply to this post by Drew Farris
As Ken noted, DF is a reasonable metric for term selection.  If you're
interested in additional discussion and/or more a sophisticated approach,
you might be interested in a paper I wrote on the topic of identifying
"informative" terms:

http://people.csail.mit.edu/jrennie/papers/sigir05-informativeness.pdf

Cheers,

Jason

On Tue, Feb 16, 2010 at 11:28 AM, Drew Farris <[hidden email]> wrote:

> I have a collection of about 800k bigrams from a corpus of 3.7m
> documents that I'm in the process of working with. I'm looking to
> determine an appropriate subset of these to use both as features for
> both an ML and an IR application. Specifically I'm considering
> white-listing a subset of these to use as features when building a
> classifier and separately as terms when building an index and doing
> query parsing. As a part of the earlier collocation discussion Ted
> mentioned that tests for over-representation could be used to identify
> dubious members of such a set.
>
> Does anyone have any pointers to discussions of how such a test could
> be implemented?
>
> Thanks,
>
> Drew
>



--
Jason Rennie
Research Scientist, ITA Software
617-714-2645
http://www.itasoftware.com/
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Jake Mannix
In reply to this post by Drew Farris
Drew,

  Did you pick your whitelist using the LLR score?  What is the kind of
over-representation you're trying to prune out?  DF will certainly help you
remove "too common" bigrams, but that's not what you're looking for, is it?

  -jake

On Feb 16, 2010 8:29 AM, "Drew Farris" <[hidden email]> wrote:

I have a collection of about 800k bigrams from a corpus of 3.7m
documents that I'm in the process of working with. I'm looking to
determine an appropriate subset of these to use both as features for
both an ML and an IR application. Specifically I'm considering
white-listing a subset of these to use as features when building a
classifier and separately as terms when building an index and doing
query parsing. As a part of the earlier collocation discussion Ted
mentioned that tests for over-representation could be used to identify
dubious members of such a set.

Does anyone have any pointers to discussions of how such a test could
be implemented?

Thanks,

Drew
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Drew Farris
Hi Jake,

Yes, I'm using the LLR score. I was wondering if there is anything
else I should be looking at other than LLR and min/max DF. The corpus
is large and the list is too big to review by hand, so wondering if
there's any sort of additional measure I can use to suggest whether I
should consider stopping additional subgrams or something of that
nature.

Ideally, this would be something that could be rolled back into the
existing collocation identifier in Mahout.

Thanks,

Drew

(Thanks also Ken, Jason for the comments and pointers -- DF is highly
effective indeed.)


On Tue, Feb 16, 2010 at 1:03 PM, Jake Mannix <[hidden email]> wrote:

> Drew,
>
>  Did you pick your whitelist using the LLR score?  What is the kind of
> over-representation you're trying to prune out?  DF will certainly help you
> remove "too common" bigrams, but that's not what you're looking for, is it?
>
>  -jake
>
> On Feb 16, 2010 8:29 AM, "Drew Farris" <[hidden email]> wrote:
>
> I have a collection of about 800k bigrams from a corpus of 3.7m
> documents that I'm in the process of working with. I'm looking to
> determine an appropriate subset of these to use both as features for
> both an ML and an IR application. Specifically I'm considering
> white-listing a subset of these to use as features when building a
> classifier and separately as terms when building an index and doing
> query parsing. As a part of the earlier collocation discussion Ted
> mentioned that tests for over-representation could be used to identify
> dubious members of such a set.
>
> Does anyone have any pointers to discussions of how such a test could
> be implemented?
>
> Thanks,
>
> Drew
>
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Ted Dunning
I think that as far as pure corpus analysis is concerned, LLR, min/max DF
and tf-idf are about as good as you will get.  TF-idf is, in fact, an
approximation of LLR, so I don't even think you need to use that (and it is
document centered rather than corpus centric in any case).  You might get
some mileage out of looking for terms that have highly variable LLR in
different documents.

To get a substantial improvement over these measures, I would recommend
adding new data to the mix.  The new data I would look at first is some sort
of user behavior history.  Do you have anything like that?

On Tue, Feb 16, 2010 at 10:22 AM, Drew Farris <[hidden email]> wrote:

> Yes, I'm using the LLR score. I was wondering if there is anything
> else I should be looking at other than LLR and min/max DF.
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Jake Mannix
In reply to this post by Drew Farris
So since you're building both a classifier and a search index, I'm guessing
to train your classifier you have at least some example docs to train on,
right?   If you have an n-way classifier in which one of the classes is
"other/unclassified", then you could look for ngrams which are
overrepresented in the union of the classes which aren't "other" (ie these
ngrams are representative of some useful class).  These ngrams could form
your whitelist.

  -jake

On Feb 16, 2010 10:23 AM, "Drew Farris" <[hidden email]> wrote:

Hi Jake,

Yes, I'm using the LLR score. I was wondering if there is anything
else I should be looking at other than LLR and min/max DF. The corpus
is large and the list is too big to review by hand, so wondering if
there's any sort of additional measure I can use to suggest whether I
should consider stopping additional subgrams or something of that
nature.

Ideally, this would be something that could be rolled back into the
existing collocation identifier in Mahout.

Thanks,

Drew

(Thanks also Ken, Jason for the comments and pointers -- DF is highly
effective indeed.)

On Tue, Feb 16, 2010 at 1:03 PM, Jake Mannix <[hidden email]> wrote:
> Drew, > >  Did you p...
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Drew Farris
In reply to this post by Ted Dunning
On Tue, Feb 16, 2010 at 1:38 PM, Ted Dunning <[hidden email]> wrote:
>
> To get a substantial improvement over these measures, I would recommend
> adding new data to the mix.  The new data I would look at first is some sort
> of user behavior history.  Do you have anything like that?

I don't have any behavioral history, but this corpus contains
documents that were generated over span of decades, so perhaps it is
valid to partition documents by time somehow. Identifying variable LLR
across documents seems pretty interesting too.

I also was wondering if comparing the ngrams found in this corpus
against a general corpus could be a worthwhile endeavor? Some quick
and dirty work suggests that the overlap in n-grams between this
domain-specific corpus and a general one is pretty low. I have some
follow-up work I need to do there to be certain. The general corpora I
have in hand include either wikipedia or large set of documents
collected from the web. I have the sneaking suspicion that these may
not be general enough when compared to that used for other statistical
work of this ilk (e.g the corpus used in the IBM MT work).

Drew
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Ted Dunning
This comparison is very interesting when against a general corpus or
specific sub-corpus already in your data.

You will often find that an n-gram is in one corpus an not in another, but
the question becomes how much this happens (i.e. does LLR say that this
happens enough to be interesting).  Taking the max over scores of many
comparisons becomes the interesting number then.

On Tue, Feb 16, 2010 at 11:01 AM, Drew Farris <[hidden email]> wrote:

> I also was wondering if comparing the ngrams found in this corpus
> against a general corpus could be a worthwhile endeavor? Some quick
> and dirty work suggests that the overlap in n-grams between this
> domain-specific corpus and a general one is pretty low.
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Jason Rennie-2
In reply to this post by Ted Dunning
On Tue, Feb 16, 2010 at 1:38 PM, Ted Dunning <[hidden email]> wrote:

> I think that as far as pure corpus analysis is concerned, LLR, min/max DF
> and tf-idf are about as good as you will get.  TF-idf is, in fact, an
> approximation of LLR, so I don't even think you need to use that (and it is
> document centered rather than corpus centric in any case).  You might get
> some mileage out of looking for terms that have highly variable LLR in
> different documents.
>

Am I incorrect in thinking that the events used for LLR here are the
occurrences of the individual terms in a bigram?  I'm looking here:

http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markup

I don't follow the argument that tf-idf is an approximation of LLR.  Are you
referring to the Papineni paper?

FWIW, I've found Residual IDF to be more effective than IDF at selecting
words.  Another useful approach is to look for bigrams which have a "peaked"
distribution; that is, considering their document frequency, they have
unusually large within-document counts.

Jason

--
Jason Rennie
Research Scientist, ITA Software
617-714-2645
http://www.itasoftware.com/
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Drew Farris
In reply to this post by Jake Mannix
The classifier I happen to be working with is entirely supervised --
the documents in the corpus are assigned categories based on
structured document data and we extract features from the text to do
the training. The whitelist identifies which n-grams should be used as
features.

I suspect there is something similar to what you described that can be
done here related to looking at the representation of n-grams in a
class vs outside a category but I need to dig deeper into the
classifier mechanics to see if that would lead to some sort of
overfitting. Thanks for the suggestion.

Drew

On Tue, Feb 16, 2010 at 1:52 PM, Jake Mannix <[hidden email]> wrote:

> So since you're building both a classifier and a search index, I'm guessing
> to train your classifier you have at least some example docs to train on,
> right?   If you have an n-way classifier in which one of the classes is
> "other/unclassified", then you could look for ngrams which are
> overrepresented in the union of the classes which aren't "other" (ie these
> ngrams are representative of some useful class).  These ngrams could form
> your whitelist.
>
>  -jake
>
> On Feb 16, 2010 10:23 AM, "Drew Farris" <[hidden email]> wrote:
>
> Hi Jake,
>
> Yes, I'm using the LLR score. I was wondering if there is anything
> else I should be looking at other than LLR and min/max DF. The corpus
> is large and the list is too big to review by hand, so wondering if
> there's any sort of additional measure I can use to suggest whether I
> should consider stopping additional subgrams or something of that
> nature.
>
> Ideally, this would be something that could be rolled back into the
> existing collocation identifier in Mahout.
>
> Thanks,
>
> Drew
>
> (Thanks also Ken, Jason for the comments and pointers -- DF is highly
> effective indeed.)
>
> On Tue, Feb 16, 2010 at 1:03 PM, Jake Mannix <[hidden email]> wrote:
>> Drew, > >  Did you p...
>
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Ted Dunning
In reply to this post by Jason Rennie-2
On Tue, Feb 16, 2010 at 11:13 AM, Jason Rennie <[hidden email]> wrote:

> Am I incorrect in thinking that the events used for LLR here are the
> occurrences of the individual terms in a bigram?  I'm looking here:
>
>
> http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markup
>

Here is my take on the matter:
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

The events are occurrences of word A (and complementarily, any non-A word)
in the first position and word B (and non-B words) in the second position.


> I don't follow the argument that tf-idf is an approximation of LLR.  Are
> you
> referring to the Papineni paper?
>

No.  I was referring to my own napkin scribblings.  If you expand the LLR
the score that uses events of word A/not A against in this document/in other
documents, you find count(A in this document) log (count of A in other
documents) as one of the dominant terms in the expression.  This is nearly
identical to tf*log(idf) in terms of the sort order imposed on terms.


--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: n-gram over-representation?

Grant Ingersoll-2

On Feb 16, 2010, at 3:18 PM, Ted Dunning wrote:

> On Tue, Feb 16, 2010 at 11:13 AM, Jason Rennie <[hidden email]> wrote:
>
>> Am I incorrect in thinking that the events used for LLR here are the
>> occurrences of the individual terms in a bigram?  I'm looking here:
>>
>>
>> http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markup
>>
>
> Here is my take on the matter:
> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>
> The events are occurrences of word A (and complementarily, any non-A word)
> in the first position and word B (and non-B words) in the second position.

Jason, the Javadocs on the file you mentioned have more or less plagiarized Ted's most excellent blog post, so hopefully it explains what you need, but there may still be room for more clarification.