# CardinalityException in DirichletDriver

31 messages
12
Open this post in threaded view
|

## CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

 On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email]>wrote: > Sorry, what does that mean :)? > It means that there is probably a programming bug somehow.  At the very least, the program is not robust with respect to strange invocations. > what is a dotted vector? and why aren't they the same? > dot product is a vector operation that is the sum of products of corresponding elements of the two vectors being operated on.  If these vectors don't have the same length, then it is an error. what should I investigate? > I am not familiar with the code, but if I had time to look, my strategy would be to start in the NormalModel and work back up the stack trace to find out how the vectors came to be different lengths.  No doubt, the code in NormalModel will not tell you anything, but you can see which vectors are involved and by walking up the stack you may be able to see where they come from. > I am basically running my complete kmeans scenario (same input data, same > number of clusters param, etc.) but just replacing KmeansDriver.main step > with a DirichletDriver.main call...of course the arguments are adjusted > since kmeans and dirichlet do not have the same arguments. > I would think that this sounds very plausible. > I am not sure what number I should give for the alpha argument, Alpha should have a value in the range from 0.01 to 20.  I would scan with 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01, 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The effect of different values should be small over a pretty wide range. > iterations > and reductions...here is my current argument set: > > args = new String[] { > "--input", > > "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec", > "--output", config.getClustersDir(), > "--modelClass", > "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution", > "--maxIter", "15", > "--alpha", "1.0", > "--k", config.getClustersCount(), > "--maxRed", "2" > }; > > Not off-hand.
Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

 I see a stack  when the size of the vectore mean is set to 2: Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel)) NormalModel.(Vector, double) line: 48 NormalModelDistribution.sampleFromPrior(int) line: 33 DirichletState.(ModelDistribution, int, double, int, int) line: 48 DirichletDriver.createState(String, int, double) line: 172 DirichletDriver.writeInitialState(String, String, String, int, double) line: 150 DirichletDriver.runJob(String, String, String, int, int, double, int) line: 133 DirichletDriver.main(String[]) line: 109 Clusters.doClustering() line: 244 Clusters.access\$0(Clusters) line: 175 Clusters\$1.run() line: 148 Thread.run() line: 619 public class NormalModelDistribution implements ModelDistribution { @Override public Model[] sampleFromPrior(int howMany) { Model[] result = new NormalModel[howMany]; for (int i = 0; i < howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return result; } and later this vector is dotted to   @Override   public double pdf(Vector x) {     double sd2 = stdDev * stdDev;     double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * sd2);     double ex = Math.exp(exp);     return ex / (stdDev * sqrt2pi);   } x vector which is coming from Hadoop MapRunner through the map function:   public void map(WritableComparable key, Vector v,                   OutputCollector output, Reporter reporter) throws IOException { any idea? btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe enough to run against trunk? On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]> wrote: > On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email] > >wrote: > > > Sorry, what does that mean :)? > > > > It means that there is probably a programming bug somehow.  At the very > least, the program is not robust with respect to strange invocations. > > > > what is a dotted vector? and why aren't they the same? > > > > dot product is a vector operation that is the sum of products of > corresponding elements of the two vectors being operated on.  If these > vectors don't have the same length, then it is an error. > > what should I investigate? > > > > I am not familiar with the code, but if I had time to look, my strategy > would be to start in the NormalModel and work back up the stack trace to > find out how the vectors came to be different lengths.  No doubt, the code > in NormalModel will not tell you anything, but you can see which vectors > are > involved and by walking up the stack you may be able to see where they come > from. > > > > I am basically running my complete kmeans scenario (same input data, same > > number of clusters param, etc.) but just replacing KmeansDriver.main step > > with a DirichletDriver.main call...of course the arguments are adjusted > > since kmeans and dirichlet do not have the same arguments. > > > > I would think that this sounds very plausible. > > > > I am not sure what number I should give for the alpha argument, > > > Alpha should have a value in the range from 0.01 to 20.  I would scan with > 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01, > 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The > effect of different values should be small over a pretty wide range. > > > > iterations > > and reductions...here is my current argument set: > > > > args = new String[] { > > "--input", > > > > > "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec", > > "--output", config.getClustersDir(), > > "--modelClass", > > "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution", > > "--maxIter", "15", > > "--alpha", "1.0", > > "--k", config.getClustersCount(), > > "--maxRed", "2" > > }; > > > > > Not off-hand. > -- Best regards, Bogdan
Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

 ok, just reproduced w/ code from trunk :| On Wed, Jan 13, 2010 at 11:07 PM, Bogdan Vatkov <[hidden email]>wrote: > I see a stack  when the size of the vectore mean is set to 2: > > Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel)) >  NormalModel.(Vector, double) line: 48 > NormalModelDistribution.sampleFromPrior(int) line: 33 >  DirichletState.(ModelDistribution, int, double, int, int) > line: 48 >  DirichletDriver.createState(String, int, double) line: 172 > DirichletDriver.writeInitialState(String, String, String, int, double) > line: 150 >  DirichletDriver.runJob(String, String, String, int, int, double, int) > line: 133 > DirichletDriver.main(String[]) line: 109 >  Clusters.doClustering() line: 244 > Clusters.access\$0(Clusters) line: 175 >  Clusters\$1.run() line: 148 > Thread.run() line: 619 > > > public class NormalModelDistribution implements ModelDistribution { > @Override public Model[] sampleFromPrior(int howMany) { > Model[] result = new NormalModel[howMany]; for (int i = 0; i < > howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return > result; } > > and later this vector is dotted to >   @Override >   public double pdf(Vector x) { >     double sd2 = stdDev * stdDev; >     double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * > sd2); >     double ex = Math.exp(exp); >     return ex / (stdDev * sqrt2pi); >   } > > x vector which is coming from Hadoop MapRunner through the map function: > >   public void map(WritableComparable key, Vector v, >                   OutputCollector output, Reporter reporter) > throws IOException { > > > any idea? > > btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe > enough to run against trunk? > > On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]>wrote: > >> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email] >> >wrote: >> >> > Sorry, what does that mean :)? >> > >> >> It means that there is probably a programming bug somehow.  At the very >> least, the program is not robust with respect to strange invocations. >> >> >> > what is a dotted vector? and why aren't they the same? >> > >> >> dot product is a vector operation that is the sum of products of >> corresponding elements of the two vectors being operated on.  If these >> vectors don't have the same length, then it is an error. >> >> what should I investigate? >> > >> >> I am not familiar with the code, but if I had time to look, my strategy >> would be to start in the NormalModel and work back up the stack trace to >> find out how the vectors came to be different lengths.  No doubt, the code >> in NormalModel will not tell you anything, but you can see which vectors >> are >> involved and by walking up the stack you may be able to see where they >> come >> from. >> >> >> > I am basically running my complete kmeans scenario (same input data, >> same >> > number of clusters param, etc.) but just replacing KmeansDriver.main >> step >> > with a DirichletDriver.main call...of course the arguments are adjusted >> > since kmeans and dirichlet do not have the same arguments. >> > >> >> I would think that this sounds very plausible. >> >> >> > I am not sure what number I should give for the alpha argument, >> >> >> Alpha should have a value in the range from 0.01 to 20.  I would scan with >> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01, >> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The >> effect of different values should be small over a pretty wide range. >> >> >> > iterations >> > and reductions...here is my current argument set: >> > >> > args = new String[] { >> > "--input", >> > >> > >> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec", >> > "--output", config.getClustersDir(), >> > "--modelClass", >> > "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution", >> > "--maxIter", "15", >> > "--alpha", "1.0", >> > "--k", config.getClustersCount(), >> > "--maxRed", "2" >> > }; >> > >> > >> Not off-hand. >> > > > > -- > Best regards, > Bogdan > > -- Best regards, Bogdan
Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

 But I am the first one to use Dirichlet which algorithm is the recommended one? Are all other algs better then Dirichlet so no one used it ;)? On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[hidden email]>wrote: > The NormalModelDistribution seems to still think all the data vectors are > size=2.  In SampleFromPrior, it is creating models with that size. > Subsequently, when you calculate the pdf with your data value (x) the sizes > are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)', > where n is your data cardinality. Please also look at the rest of the math > in DenseVector with suspiscion. AFAIK, you are the first person to try to > use Dirichlet. > > > > Bogdan Vatkov wrote: > >> I see a stack  when the size of the vectore mean is set to 2: >> >> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in >> NormalModel)) >> NormalModel.(Vector, double) line: 48 >> NormalModelDistribution.sampleFromPrior(int) line: 33 >> DirichletState.(ModelDistribution, int, double, int, int) >> line: >> 48 >> DirichletDriver.createState(String, int, double) line: 172 >> DirichletDriver.writeInitialState(String, String, String, int, double) >> line: >> 150 >> DirichletDriver.runJob(String, String, String, int, int, double, int) >> line: >> 133 >> DirichletDriver.main(String[]) line: 109 >> Clusters.doClustering() line: 244 >> Clusters.access\$0(Clusters) line: 175 >> Clusters\$1.run() line: 148 >> Thread.run() line: 619 >> >> >> public class NormalModelDistribution implements ModelDistribution >> { >> @Override public Model[] sampleFromPrior(int howMany) { >> Model[] result = new NormalModel[howMany]; for (int i = 0; i < >> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } >> return >> result; } >> >> and later this vector is dotted to >>  @Override >>  public double pdf(Vector x) { >>    double sd2 = stdDev * stdDev; >>    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * >> sd2); >>    double ex = Math.exp(exp); >>    return ex / (stdDev * sqrt2pi); >>  } >> >> x vector which is coming from Hadoop MapRunner through the map function: >> >>  public void map(WritableComparable key, Vector v, >>                  OutputCollector output, Reporter reporter) >> throws IOException { >> >> >> any idea? >> >> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it >> safe >> enough to run against trunk? >> >> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]> >> wrote: >> >> >> >>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email] >>> >>> >>>> wrote: >>>>      Sorry, what does that mean :)? >>>> >>>> >>>> >>> It means that there is probably a programming bug somehow.  At the very >>> least, the program is not robust with respect to strange invocations. >>> >>> >>> >>> >>>> what is a dotted vector? and why aren't they the same? >>>> >>>> >>>> >>> dot product is a vector operation that is the sum of products of >>> corresponding elements of the two vectors being operated on.  If these >>> vectors don't have the same length, then it is an error. >>> >>> what should I investigate? >>>    I am not familiar with the code, but if I had time to look, my >>> strategy >>> would be to start in the NormalModel and work back up the stack trace to >>> find out how the vectors came to be different lengths.  No doubt, the >>> code >>> in NormalModel will not tell you anything, but you can see which vectors >>> are >>> involved and by walking up the stack you may be able to see where they >>> come >>> from. >>> >>> >>> >>> >>>> I am basically running my complete kmeans scenario (same input data, >>>> same >>>> number of clusters param, etc.) but just replacing KmeansDriver.main >>>> step >>>> with a DirichletDriver.main call...of course the arguments are adjusted >>>> since kmeans and dirichlet do not have the same arguments. >>>> >>>> >>>> >>> I would think that this sounds very plausible. >>> >>> >>> >>> >>>> I am not sure what number I should give for the alpha argument, >>>> >>>> >>> Alpha should have a value in the range from 0.01 to 20.  I would scan >>> with >>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01, >>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start. >>>  The >>> effect of different values should be small over a pretty wide range. >>> >>> >>> >>> >>>> iterations >>>> and reductions...here is my current argument set: >>>> >>>> args = new String[] { >>>> "--input", >>>> >>>> >>>> >>>> >>> >>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec", >>> >>> >>>> "--output", config.getClustersDir(), >>>> "--modelClass", >>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution", >>>> "--maxIter", "15", >>>> "--alpha", "1.0", >>>> "--k", config.getClustersCount(), >>>> "--maxRed", "2" >>>> }; >>>> >>>> >>>> >>>> >>> Not off-hand. >>> >>> >>> >> >> >> >> >> > > -- Best regards, Bogdan
Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

 In reply to this post by Jeff Eastman On Wed, Jan 13, 2010 at 3:23 PM, Jeff Eastman <[hidden email]>wrote: > The NormalModelDistribution seems to still think all the data vectors are > size=2.  In SampleFromPrior, it is creating models with that size. > Subsequently, when you calculate the pdf with your data value (x) the sizes > are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)', > where n is your data cardinality. Please also look at the rest of the math > in DenseVector with suspiscion. AFAIK, you are the first person to try to > use Dirichlet. Ack, this is bad - why have we not caught this in unit tests?   -jake
Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

 Because the unit tests were 2-dimensional examples. On Wed, Jan 13, 2010 at 3:34 PM, Jake Mannix <[hidden email]> wrote: > Ack, this is bad - why have we not caught this in unit tests? -- Ted Dunning, CTO DeepDyve
Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

 In reply to this post by Bogdan94202 Dirichlet has some theoretical advantages (such as deducing how many clusters are justified and providing non-deterministic answers when ambiguity is present).  It has no run-time.  It probably is more delicate with respect to parameter settings. If you have some time budget, I think you could get some substantial improvements. If you are in a hurry, then k-means will work much better. On Wed, Jan 13, 2010 at 3:26 PM, Bogdan Vatkov <[hidden email]>wrote: > But I am the first one to use Dirichlet which algorithm is the recommended > one? Are all other algs better then Dirichlet so no one used it ;)? > -- Ted Dunning, CTO DeepDyve
Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

 In reply to this post by Ted Dunning Because of its non-deterministic nature, Dirichlet is darn hard to test. The 2-d tests offer the option of plotting out the points and the models and eyeballing the result (http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html) but more rigorous testing and higher order problems in general are needed. There was a student on this list last summer who offered some pointed suggestions but he did not follow up and I've been under water in a startup. Ted Dunning wrote: > Because the unit tests were 2-dimensional examples. > > On Wed, Jan 13, 2010 at 3:34 PM, Jake Mannix <[hidden email]> wrote: > >   >> Ack, this is bad - why have we not caught this in unit tests? >>     > > > > >
Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

Open this post in threaded view
|

## Re: CardinalityException in DirichletDriver

 OT: Every time I see this go by, I expect to see 'Cardinality' and 'Richelieu'.