CardinalityException in DirichletDriver

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

CardinalityException in DirichletDriver

Bogdan94202
what could be the reason for this Cardinality exception?

10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Wrote: 174 vectors
10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Dictionary Output
file:
/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/dictionary.txt
10/01/13 01:41:11 INFO dirichlet.DirichletDriver: Iteration 0
10/01/13 01:41:11 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
10/01/13 01:41:11 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to process
: 1
10/01/13 01:41:11 INFO mapred.JobClient: Running job: job_local_0001
10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to process
: 1
10/01/13 01:41:11 INFO compress.CodecPool: Got brand-new decompressor
10/01/13 01:41:11 INFO mapred.MapTask: numReduceTasks: 1
10/01/13 01:41:11 INFO mapred.MapTask: io.sort.mb = 100
10/01/13 01:41:12 INFO mapred.MapTask: data buffer = 79691776/99614720
10/01/13 01:41:12 INFO mapred.MapTask: record buffer = 262144/327680
10/01/13 01:41:12 WARN mapred.LocalJobRunner: job_local_0001
org.apache.mahout.matrix.CardinalityException
at org.apache.mahout.matrix.AbstractVector.dot(AbstractVector.java:92)
at
org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:111)
at
org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:28)
at
org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.normalizedProbabilities(DirichletMapper.java:111)
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
at
org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:38)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
10/01/13 01:41:12 INFO mapred.JobClient:  map 0% reduce 0%
10/01/13 01:41:12 INFO mapred.JobClient: Job complete: job_local_0001
10/01/13 01:41:12 INFO mapred.JobClient: Counters: 0
10/01/13 01:41:12 WARN dirichlet.DirichletDriver: java.io.IOException: Job
failed!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at
org.apache.mahout.clustering.dirichlet.DirichletDriver.runIteration(DirichletDriver.java:214)
at
org.apache.mahout.clustering.dirichlet.DirichletDriver.runJob(DirichletDriver.java:139)
at
org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:109)
at org.bogdan.clustering.mbeans.Clusters.doClustering(Clusters.java:244)
at org.bogdan.clustering.mbeans.Clusters.access$0(Clusters.java:175)
at org.bogdan.clustering.mbeans.Clusters$1.run(Clusters.java:148)
at java.lang.Thread.run(Thread.java:619)
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Grant Ingersoll-2
I don't have the code in front of me, but if I had to guess based on the location of the stack trace, I'm going to guess it is b/c the sizes of the two vectors being "dotted" aren't the same.

On Jan 12, 2010, at 6:46 PM, Bogdan Vatkov wrote:

> what could be the reason for this Cardinality exception?
>
> 10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Wrote: 174 vectors
> 10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Dictionary Output
> file:
> /store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/dictionary.txt
> 10/01/13 01:41:11 INFO dirichlet.DirichletDriver: Iteration 0
> 10/01/13 01:41:11 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 10/01/13 01:41:11 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 10/01/13 01:41:11 INFO mapred.JobClient: Running job: job_local_0001
> 10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 10/01/13 01:41:11 INFO compress.CodecPool: Got brand-new decompressor
> 10/01/13 01:41:11 INFO mapred.MapTask: numReduceTasks: 1
> 10/01/13 01:41:11 INFO mapred.MapTask: io.sort.mb = 100
> 10/01/13 01:41:12 INFO mapred.MapTask: data buffer = 79691776/99614720
> 10/01/13 01:41:12 INFO mapred.MapTask: record buffer = 262144/327680
> 10/01/13 01:41:12 WARN mapred.LocalJobRunner: job_local_0001
> org.apache.mahout.matrix.CardinalityException
> at org.apache.mahout.matrix.AbstractVector.dot(AbstractVector.java:92)
> at
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:111)
> at
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:28)
> at
> org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
> at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.normalizedProbabilities(DirichletMapper.java:111)
> at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
> at
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:38)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
> 10/01/13 01:41:12 INFO mapred.JobClient:  map 0% reduce 0%
> 10/01/13 01:41:12 INFO mapred.JobClient: Job complete: job_local_0001
> 10/01/13 01:41:12 INFO mapred.JobClient: Counters: 0
> 10/01/13 01:41:12 WARN dirichlet.DirichletDriver: java.io.IOException: Job
> failed!
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> at
> org.apache.mahout.clustering.dirichlet.DirichletDriver.runIteration(DirichletDriver.java:214)
> at
> org.apache.mahout.clustering.dirichlet.DirichletDriver.runJob(DirichletDriver.java:139)
> at
> org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:109)
> at org.bogdan.clustering.mbeans.Clusters.doClustering(Clusters.java:244)
> at org.bogdan.clustering.mbeans.Clusters.access$0(Clusters.java:175)
> at org.bogdan.clustering.mbeans.Clusters$1.run(Clusters.java:148)
> at java.lang.Thread.run(Thread.java:619)

Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Bogdan94202
Sorry, what does that mean :)?
what is a dotted vector? and why aren't they the same?
what should I investigate?
I am basically running my complete kmeans scenario (same input data, same
number of clusters param, etc.) but just replacing KmeansDriver.main step
with a DirichletDriver.main call...of course the arguments are adjusted
since kmeans and dirichlet do not have the same arguments.
I am not sure what number I should give for the alpha argument, iterations
and reductions...here is my current argument set:

args = new String[] {
"--input",
"/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
"--output", config.getClustersDir(),
"--modelClass",
"org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
"--maxIter", "15",
"--alpha", "1.0",
"--k", config.getClustersCount(),
"--maxRed", "2"
};

anything suspicious in there?
On Wed, Jan 13, 2010 at 2:44 AM, Grant Ingersoll <[hidden email]>wrote:

> I don't have the code in front of me, but if I had to guess based on the
> location of the stack trace, I'm going to guess it is b/c the sizes of the
> two vectors being "dotted" aren't the same.
>
> On Jan 12, 2010, at 6:46 PM, Bogdan Vatkov wrote:
>
> > what could be the reason for this Cardinality exception?
> >
> > 10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Wrote: 174 vectors
> > 10/01/13 01:41:09 INFO clustering.SolrToMahoutDriver: Dictionary Output
> > file:
> > /store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/dictionary.txt
> > 10/01/13 01:41:11 INFO dirichlet.DirichletDriver: Iteration 0
> > 10/01/13 01:41:11 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> > processName=JobTracker, sessionId=
> > 10/01/13 01:41:11 WARN mapred.JobClient: Use GenericOptionsParser for
> > parsing the arguments. Applications should implement Tool for the same.
> > 10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to
> process
> > : 1
> > 10/01/13 01:41:11 INFO mapred.JobClient: Running job: job_local_0001
> > 10/01/13 01:41:11 INFO mapred.FileInputFormat: Total input paths to
> process
> > : 1
> > 10/01/13 01:41:11 INFO compress.CodecPool: Got brand-new decompressor
> > 10/01/13 01:41:11 INFO mapred.MapTask: numReduceTasks: 1
> > 10/01/13 01:41:11 INFO mapred.MapTask: io.sort.mb = 100
> > 10/01/13 01:41:12 INFO mapred.MapTask: data buffer = 79691776/99614720
> > 10/01/13 01:41:12 INFO mapred.MapTask: record buffer = 262144/327680
> > 10/01/13 01:41:12 WARN mapred.LocalJobRunner: job_local_0001
> > org.apache.mahout.matrix.CardinalityException
> > at org.apache.mahout.matrix.AbstractVector.dot(AbstractVector.java:92)
> > at
> >
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:111)
> > at
> >
> org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:28)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletMapper.normalizedProbabilities(DirichletMapper.java:111)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:38)
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
> > 10/01/13 01:41:12 INFO mapred.JobClient:  map 0% reduce 0%
> > 10/01/13 01:41:12 INFO mapred.JobClient: Job complete: job_local_0001
> > 10/01/13 01:41:12 INFO mapred.JobClient: Counters: 0
> > 10/01/13 01:41:12 WARN dirichlet.DirichletDriver: java.io.IOException:
> Job
> > failed!
> > java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.runIteration(DirichletDriver.java:214)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.runJob(DirichletDriver.java:139)
> > at
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:109)
> > at org.bogdan.clustering.mbeans.Clusters.doClustering(Clusters.java:244)
> > at org.bogdan.clustering.mbeans.Clusters.access$0(Clusters.java:175)
> > at org.bogdan.clustering.mbeans.Clusters$1.run(Clusters.java:148)
> > at java.lang.Thread.run(Thread.java:619)
>
>


--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Ted Dunning
On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email]>wrote:

> Sorry, what does that mean :)?
>

It means that there is probably a programming bug somehow.  At the very
least, the program is not robust with respect to strange invocations.


> what is a dotted vector? and why aren't they the same?
>

dot product is a vector operation that is the sum of products of
corresponding elements of the two vectors being operated on.  If these
vectors don't have the same length, then it is an error.

what should I investigate?
>

I am not familiar with the code, but if I had time to look, my strategy
would be to start in the NormalModel and work back up the stack trace to
find out how the vectors came to be different lengths.  No doubt, the code
in NormalModel will not tell you anything, but you can see which vectors are
involved and by walking up the stack you may be able to see where they come
from.


> I am basically running my complete kmeans scenario (same input data, same
> number of clusters param, etc.) but just replacing KmeansDriver.main step
> with a DirichletDriver.main call...of course the arguments are adjusted
> since kmeans and dirichlet do not have the same arguments.
>

I would think that this sounds very plausible.


> I am not sure what number I should give for the alpha argument,


Alpha should have a value in the range from 0.01 to 20.  I would scan with
1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
effect of different values should be small over a pretty wide range.


> iterations
> and reductions...here is my current argument set:
>
> args = new String[] {
> "--input",
>
> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
> "--output", config.getClustersDir(),
> "--modelClass",
> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
> "--maxIter", "15",
> "--alpha", "1.0",
> "--k", config.getClustersCount(),
> "--maxRed", "2"
> };
>
>
Not off-hand.
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Bogdan94202
I see a stack  when the size of the vectore mean is set to 2:

Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel))
NormalModel.<init>(Vector, double) line: 48
NormalModelDistribution.sampleFromPrior(int) line: 33
DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) line:
48
DirichletDriver.createState(String, int, double) line: 172
DirichletDriver.writeInitialState(String, String, String, int, double) line:
150
DirichletDriver.runJob(String, String, String, int, int, double, int) line:
133
DirichletDriver.main(String[]) line: 109
Clusters.doClustering() line: 244
Clusters.access$0(Clusters) line: 175
Clusters$1.run() line: 148
Thread.run() line: 619


public class NormalModelDistribution implements ModelDistribution<Vector> {
@Override public Model<Vector>[] sampleFromPrior(int howMany) {
Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return
result; }

and later this vector is dotted to
  @Override
  public double pdf(Vector x) {
    double sd2 = stdDev * stdDev;
    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * sd2);
    double ex = Math.exp(exp);
    return ex / (stdDev * sqrt2pi);
  }

x vector which is coming from Hadoop MapRunner through the map function:

  public void map(WritableComparable<?> key, Vector v,
                  OutputCollector<Text, Vector> output, Reporter reporter)
throws IOException {


any idea?

btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe
enough to run against trunk?

On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]> wrote:

> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email]
> >wrote:
>
> > Sorry, what does that mean :)?
> >
>
> It means that there is probably a programming bug somehow.  At the very
> least, the program is not robust with respect to strange invocations.
>
>
> > what is a dotted vector? and why aren't they the same?
> >
>
> dot product is a vector operation that is the sum of products of
> corresponding elements of the two vectors being operated on.  If these
> vectors don't have the same length, then it is an error.
>
> what should I investigate?
> >
>
> I am not familiar with the code, but if I had time to look, my strategy
> would be to start in the NormalModel and work back up the stack trace to
> find out how the vectors came to be different lengths.  No doubt, the code
> in NormalModel will not tell you anything, but you can see which vectors
> are
> involved and by walking up the stack you may be able to see where they come
> from.
>
>
> > I am basically running my complete kmeans scenario (same input data, same
> > number of clusters param, etc.) but just replacing KmeansDriver.main step
> > with a DirichletDriver.main call...of course the arguments are adjusted
> > since kmeans and dirichlet do not have the same arguments.
> >
>
> I would think that this sounds very plausible.
>
>
> > I am not sure what number I should give for the alpha argument,
>
>
> Alpha should have a value in the range from 0.01 to 20.  I would scan with
> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
> effect of different values should be small over a pretty wide range.
>
>
> > iterations
> > and reductions...here is my current argument set:
> >
> > args = new String[] {
> > "--input",
> >
> >
> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
> > "--output", config.getClustersDir(),
> > "--modelClass",
> > "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
> > "--maxIter", "15",
> > "--alpha", "1.0",
> > "--k", config.getClustersCount(),
> > "--maxRed", "2"
> > };
> >
> >
> Not off-hand.
>



--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Bogdan94202
ok, just reproduced w/ code from trunk :|

On Wed, Jan 13, 2010 at 11:07 PM, Bogdan Vatkov <[hidden email]>wrote:

> I see a stack  when the size of the vectore mean is set to 2:
>
> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel))
>  NormalModel.<init>(Vector, double) line: 48
> NormalModelDistribution.sampleFromPrior(int) line: 33
>  DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
> line: 48
>  DirichletDriver.createState(String, int, double) line: 172
> DirichletDriver.writeInitialState(String, String, String, int, double)
> line: 150
>  DirichletDriver.runJob(String, String, String, int, int, double, int)
> line: 133
> DirichletDriver.main(String[]) line: 109
>  Clusters.doClustering() line: 244
> Clusters.access$0(Clusters) line: 175
>  Clusters$1.run() line: 148
> Thread.run() line: 619
>
>
> public class NormalModelDistribution implements ModelDistribution<Vector> {
> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return
> result; }
>
> and later this vector is dotted to
>   @Override
>   public double pdf(Vector x) {
>     double sd2 = stdDev * stdDev;
>     double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
> sd2);
>     double ex = Math.exp(exp);
>     return ex / (stdDev * sqrt2pi);
>   }
>
> x vector which is coming from Hadoop MapRunner through the map function:
>
>   public void map(WritableComparable<?> key, Vector v,
>                   OutputCollector<Text, Vector> output, Reporter reporter)
> throws IOException {
>
>
> any idea?
>
> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe
> enough to run against trunk?
>
> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]>wrote:
>
>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email]
>> >wrote:
>>
>> > Sorry, what does that mean :)?
>> >
>>
>> It means that there is probably a programming bug somehow.  At the very
>> least, the program is not robust with respect to strange invocations.
>>
>>
>> > what is a dotted vector? and why aren't they the same?
>> >
>>
>> dot product is a vector operation that is the sum of products of
>> corresponding elements of the two vectors being operated on.  If these
>> vectors don't have the same length, then it is an error.
>>
>> what should I investigate?
>> >
>>
>> I am not familiar with the code, but if I had time to look, my strategy
>> would be to start in the NormalModel and work back up the stack trace to
>> find out how the vectors came to be different lengths.  No doubt, the code
>> in NormalModel will not tell you anything, but you can see which vectors
>> are
>> involved and by walking up the stack you may be able to see where they
>> come
>> from.
>>
>>
>> > I am basically running my complete kmeans scenario (same input data,
>> same
>> > number of clusters param, etc.) but just replacing KmeansDriver.main
>> step
>> > with a DirichletDriver.main call...of course the arguments are adjusted
>> > since kmeans and dirichlet do not have the same arguments.
>> >
>>
>> I would think that this sounds very plausible.
>>
>>
>> > I am not sure what number I should give for the alpha argument,
>>
>>
>> Alpha should have a value in the range from 0.01 to 20.  I would scan with
>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
>> effect of different values should be small over a pretty wide range.
>>
>>
>> > iterations
>> > and reductions...here is my current argument set:
>> >
>> > args = new String[] {
>> > "--input",
>> >
>> >
>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>> > "--output", config.getClustersDir(),
>> > "--modelClass",
>> > "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>> > "--maxIter", "15",
>> > "--alpha", "1.0",
>> > "--k", config.getClustersCount(),
>> > "--maxRed", "2"
>> > };
>> >
>> >
>> Not off-hand.
>>
>
>
>
> --
> Best regards,
> Bogdan
>
>


--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Jeff Eastman
In reply to this post by Bogdan94202
The NormalModelDistribution seems to still think all the data vectors
are size=2.  In SampleFromPrior, it is creating models with that size.
Subsequently, when you calculate the pdf with your data value (x) the
sizes are incompatible. Suggest changing 'DenseVector(2)' to
'DenseVector(n)', where n is your data cardinality. Please also look at
the rest of the math in DenseVector with suspiscion. AFAIK, you are the
first person to try to use Dirichlet.


Bogdan Vatkov wrote:

> I see a stack  when the size of the vectore mean is set to 2:
>
> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in NormalModel))
> NormalModel.<init>(Vector, double) line: 48
> NormalModelDistribution.sampleFromPrior(int) line: 33
> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int) line:
> 48
> DirichletDriver.createState(String, int, double) line: 172
> DirichletDriver.writeInitialState(String, String, String, int, double) line:
> 150
> DirichletDriver.runJob(String, String, String, int, int, double, int) line:
> 133
> DirichletDriver.main(String[]) line: 109
> Clusters.doClustering() line: 244
> Clusters.access$0(Clusters) line: 175
> Clusters$1.run() line: 148
> Thread.run() line: 619
>
>
> public class NormalModelDistribution implements ModelDistribution<Vector> {
> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); } return
> result; }
>
> and later this vector is dotted to
>   @Override
>   public double pdf(Vector x) {
>     double sd2 = stdDev * stdDev;
>     double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 * sd2);
>     double ex = Math.exp(exp);
>     return ex / (stdDev * sqrt2pi);
>   }
>
> x vector which is coming from Hadoop MapRunner through the map function:
>
>   public void map(WritableComparable<?> key, Vector v,
>                   OutputCollector<Text, Vector> output, Reporter reporter)
> throws IOException {
>
>
> any idea?
>
> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it safe
> enough to run against trunk?
>
> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]> wrote:
>
>  
>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email]
>>    
>>> wrote:
>>>      
>>> Sorry, what does that mean :)?
>>>
>>>      
>> It means that there is probably a programming bug somehow.  At the very
>> least, the program is not robust with respect to strange invocations.
>>
>>
>>    
>>> what is a dotted vector? and why aren't they the same?
>>>
>>>      
>> dot product is a vector operation that is the sum of products of
>> corresponding elements of the two vectors being operated on.  If these
>> vectors don't have the same length, then it is an error.
>>
>> what should I investigate?
>>    
>> I am not familiar with the code, but if I had time to look, my strategy
>> would be to start in the NormalModel and work back up the stack trace to
>> find out how the vectors came to be different lengths.  No doubt, the code
>> in NormalModel will not tell you anything, but you can see which vectors
>> are
>> involved and by walking up the stack you may be able to see where they come
>> from.
>>
>>
>>    
>>> I am basically running my complete kmeans scenario (same input data, same
>>> number of clusters param, etc.) but just replacing KmeansDriver.main step
>>> with a DirichletDriver.main call...of course the arguments are adjusted
>>> since kmeans and dirichlet do not have the same arguments.
>>>
>>>      
>> I would think that this sounds very plausible.
>>
>>
>>    
>>> I am not sure what number I should give for the alpha argument,
>>>      
>> Alpha should have a value in the range from 0.01 to 20.  I would scan with
>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.  The
>> effect of different values should be small over a pretty wide range.
>>
>>
>>    
>>> iterations
>>> and reductions...here is my current argument set:
>>>
>>> args = new String[] {
>>> "--input",
>>>
>>>
>>>      
>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>    
>>> "--output", config.getClustersDir(),
>>> "--modelClass",
>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>> "--maxIter", "15",
>>> "--alpha", "1.0",
>>> "--k", config.getClustersCount(),
>>> "--maxRed", "2"
>>> };
>>>
>>>
>>>      
>> Not off-hand.
>>
>>    
>
>
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Bogdan94202
But I am the first one to use Dirichlet which algorithm is the recommended
one? Are all other algs better then Dirichlet so no one used it ;)?

On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[hidden email]>wrote:

> The NormalModelDistribution seems to still think all the data vectors are
> size=2.  In SampleFromPrior, it is creating models with that size.
> Subsequently, when you calculate the pdf with your data value (x) the sizes
> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
> where n is your data cardinality. Please also look at the rest of the math
> in DenseVector with suspiscion. AFAIK, you are the first person to try to
> use Dirichlet.
>
>
>
> Bogdan Vatkov wrote:
>
>> I see a stack  when the size of the vectore mean is set to 2:
>>
>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>> NormalModel))
>> NormalModel.<init>(Vector, double) line: 48
>> NormalModelDistribution.sampleFromPrior(int) line: 33
>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>> line:
>> 48
>> DirichletDriver.createState(String, int, double) line: 172
>> DirichletDriver.writeInitialState(String, String, String, int, double)
>> line:
>> 150
>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>> line:
>> 133
>> DirichletDriver.main(String[]) line: 109
>> Clusters.doClustering() line: 244
>> Clusters.access$0(Clusters) line: 175
>> Clusters$1.run() line: 148
>> Thread.run() line: 619
>>
>>
>> public class NormalModelDistribution implements ModelDistribution<Vector>
>> {
>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>> return
>> result; }
>>
>> and later this vector is dotted to
>>  @Override
>>  public double pdf(Vector x) {
>>    double sd2 = stdDev * stdDev;
>>    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>> sd2);
>>    double ex = Math.exp(exp);
>>    return ex / (stdDev * sqrt2pi);
>>  }
>>
>> x vector which is coming from Hadoop MapRunner through the map function:
>>
>>  public void map(WritableComparable<?> key, Vector v,
>>                  OutputCollector<Text, Vector> output, Reporter reporter)
>> throws IOException {
>>
>>
>> any idea?
>>
>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>> safe
>> enough to run against trunk?
>>
>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]>
>> wrote:
>>
>>
>>
>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email]
>>>
>>>
>>>> wrote:
>>>>      Sorry, what does that mean :)?
>>>>
>>>>
>>>>
>>> It means that there is probably a programming bug somehow.  At the very
>>> least, the program is not robust with respect to strange invocations.
>>>
>>>
>>>
>>>
>>>> what is a dotted vector? and why aren't they the same?
>>>>
>>>>
>>>>
>>> dot product is a vector operation that is the sum of products of
>>> corresponding elements of the two vectors being operated on.  If these
>>> vectors don't have the same length, then it is an error.
>>>
>>> what should I investigate?
>>>    I am not familiar with the code, but if I had time to look, my
>>> strategy
>>> would be to start in the NormalModel and work back up the stack trace to
>>> find out how the vectors came to be different lengths.  No doubt, the
>>> code
>>> in NormalModel will not tell you anything, but you can see which vectors
>>> are
>>> involved and by walking up the stack you may be able to see where they
>>> come
>>> from.
>>>
>>>
>>>
>>>
>>>> I am basically running my complete kmeans scenario (same input data,
>>>> same
>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>> step
>>>> with a DirichletDriver.main call...of course the arguments are adjusted
>>>> since kmeans and dirichlet do not have the same arguments.
>>>>
>>>>
>>>>
>>> I would think that this sounds very plausible.
>>>
>>>
>>>
>>>
>>>> I am not sure what number I should give for the alpha argument,
>>>>
>>>>
>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>> with
>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>  The
>>> effect of different values should be small over a pretty wide range.
>>>
>>>
>>>
>>>
>>>> iterations
>>>> and reductions...here is my current argument set:
>>>>
>>>> args = new String[] {
>>>> "--input",
>>>>
>>>>
>>>>
>>>>
>>>
>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>
>>>
>>>> "--output", config.getClustersDir(),
>>>> "--modelClass",
>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>> "--maxIter", "15",
>>>> "--alpha", "1.0",
>>>> "--k", config.getClustersCount(),
>>>> "--maxRed", "2"
>>>> };
>>>>
>>>>
>>>>
>>>>
>>> Not off-hand.
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Jake Mannix
In reply to this post by Jeff Eastman
On Wed, Jan 13, 2010 at 3:23 PM, Jeff Eastman <[hidden email]>wrote:

> The NormalModelDistribution seems to still think all the data vectors are
> size=2.  In SampleFromPrior, it is creating models with that size.
> Subsequently, when you calculate the pdf with your data value (x) the sizes
> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
> where n is your data cardinality. Please also look at the rest of the math
> in DenseVector with suspiscion. AFAIK, you are the first person to try to
> use Dirichlet.


Ack, this is bad - why have we not caught this in unit tests?

  -jake
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Ted Dunning
Because the unit tests were 2-dimensional examples.

On Wed, Jan 13, 2010 at 3:34 PM, Jake Mannix <[hidden email]> wrote:

> Ack, this is bad - why have we not caught this in unit tests?




--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Ted Dunning
In reply to this post by Bogdan94202
Dirichlet has some theoretical advantages (such as deducing how many
clusters are justified and providing non-deterministic answers when
ambiguity is present).  It has no run-time.  It probably is more delicate
with respect to parameter settings.

If you have some time budget, I think you could get some substantial
improvements.

If you are in a hurry, then k-means will work much better.

On Wed, Jan 13, 2010 at 3:26 PM, Bogdan Vatkov <[hidden email]>wrote:

> But I am the first one to use Dirichlet which algorithm is the recommended
> one? Are all other algs better then Dirichlet so no one used it ;)?
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Jeff Eastman
In reply to this post by Bogdan94202
I think KMeans and Canopy are the most-used and therefore the most
robust. Dirichlet still has not seen much use beyond some test examples
and NormalModel has at least one known problem (with sample() only
returning the maximum likelihood) that has been reported but never
fixed. Can you point me to the problem you are running so I can try to
get up to speed? It has been some time since I worked in this code but
I'm keen to do so and I have some time to invest.

Jeff


Bogdan Vatkov wrote:

> But I am the first one to use Dirichlet which algorithm is the recommended
> one? Are all other algs better then Dirichlet so no one used it ;)?
>
> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[hidden email]>wrote:
>
>  
>> The NormalModelDistribution seems to still think all the data vectors are
>> size=2.  In SampleFromPrior, it is creating models with that size.
>> Subsequently, when you calculate the pdf with your data value (x) the sizes
>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
>> where n is your data cardinality. Please also look at the rest of the math
>> in DenseVector with suspiscion. AFAIK, you are the first person to try to
>> use Dirichlet.
>>
>>
>>
>> Bogdan Vatkov wrote:
>>
>>    
>>> I see a stack  when the size of the vectore mean is set to 2:
>>>
>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>> NormalModel))
>>> NormalModel.<init>(Vector, double) line: 48
>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>> line:
>>> 48
>>> DirichletDriver.createState(String, int, double) line: 172
>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>> line:
>>> 150
>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>> line:
>>> 133
>>> DirichletDriver.main(String[]) line: 109
>>> Clusters.doClustering() line: 244
>>> Clusters.access$0(Clusters) line: 175
>>> Clusters$1.run() line: 148
>>> Thread.run() line: 619
>>>
>>>
>>> public class NormalModelDistribution implements ModelDistribution<Vector>
>>> {
>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>> return
>>> result; }
>>>
>>> and later this vector is dotted to
>>>  @Override
>>>  public double pdf(Vector x) {
>>>    double sd2 = stdDev * stdDev;
>>>    double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>> sd2);
>>>    double ex = Math.exp(exp);
>>>    return ex / (stdDev * sqrt2pi);
>>>  }
>>>
>>> x vector which is coming from Hadoop MapRunner through the map function:
>>>
>>>  public void map(WritableComparable<?> key, Vector v,
>>>                  OutputCollector<Text, Vector> output, Reporter reporter)
>>> throws IOException {
>>>
>>>
>>> any idea?
>>>
>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>> safe
>>> enough to run against trunk?
>>>
>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]>
>>> wrote:
>>>
>>>
>>>
>>>      
>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <[hidden email]
>>>>
>>>>
>>>>        
>>>>> wrote:
>>>>>      Sorry, what does that mean :)?
>>>>>
>>>>>
>>>>>
>>>>>          
>>>> It means that there is probably a programming bug somehow.  At the very
>>>> least, the program is not robust with respect to strange invocations.
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>> what is a dotted vector? and why aren't they the same?
>>>>>
>>>>>
>>>>>
>>>>>          
>>>> dot product is a vector operation that is the sum of products of
>>>> corresponding elements of the two vectors being operated on.  If these
>>>> vectors don't have the same length, then it is an error.
>>>>
>>>> what should I investigate?
>>>>    I am not familiar with the code, but if I had time to look, my
>>>> strategy
>>>> would be to start in the NormalModel and work back up the stack trace to
>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>> code
>>>> in NormalModel will not tell you anything, but you can see which vectors
>>>> are
>>>> involved and by walking up the stack you may be able to see where they
>>>> come
>>>> from.
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>> same
>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>> step
>>>>> with a DirichletDriver.main call...of course the arguments are adjusted
>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>
>>>>>
>>>>>
>>>>>          
>>>> I would think that this sounds very plausible.
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>> I am not sure what number I should give for the alpha argument,
>>>>>
>>>>>
>>>>>          
>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>> with
>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e. 0.01,
>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>  The
>>>> effect of different values should be small over a pretty wide range.
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>> iterations
>>>>> and reductions...here is my current argument set:
>>>>>
>>>>> args = new String[] {
>>>>> "--input",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>          
>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>
>>>>
>>>>        
>>>>> "--output", config.getClustersDir(),
>>>>> "--modelClass",
>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>> "--maxIter", "15",
>>>>> "--alpha", "1.0",
>>>>> "--k", config.getClustersCount(),
>>>>> "--maxRed", "2"
>>>>> };
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>          
>>>> Not off-hand.
>>>>
>>>>
>>>>
>>>>        
>>>
>>>
>>>
>>>      
>>    
>
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Liang Chenmin
I had similar bugs before. And found out that it's due to some changes in my
code, which generate two vectors with different lengths. You could print out
some logs and look in the data generated to check.

On Wed, Jan 13, 2010 at 3:49 PM, Jeff Eastman <[hidden email]>wrote:

> I think KMeans and Canopy are the most-used and therefore the most robust.
> Dirichlet still has not seen much use beyond some test examples and
> NormalModel has at least one known problem (with sample() only returning the
> maximum likelihood) that has been reported but never fixed. Can you point me
> to the problem you are running so I can try to get up to speed? It has been
> some time since I worked in this code but I'm keen to do so and I have some
> time to invest.
>
> Jeff
>
>
>
> Bogdan Vatkov wrote:
>
>> But I am the first one to use Dirichlet which algorithm is the recommended
>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>
>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[hidden email]
>> >wrote:
>>
>>
>>
>>> The NormalModelDistribution seems to still think all the data vectors are
>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>> Subsequently, when you calculate the pdf with your data value (x) the
>>> sizes
>>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
>>> where n is your data cardinality. Please also look at the rest of the
>>> math
>>> in DenseVector with suspiscion. AFAIK, you are the first person to try to
>>> use Dirichlet.
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>
>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>> NormalModel))
>>>> NormalModel.<init>(Vector, double) line: 48
>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>> line:
>>>> 48
>>>> DirichletDriver.createState(String, int, double) line: 172
>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>> line:
>>>> 150
>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>> line:
>>>> 133
>>>> DirichletDriver.main(String[]) line: 109
>>>> Clusters.doClustering() line: 244
>>>> Clusters.access$0(Clusters) line: 175
>>>> Clusters$1.run() line: 148
>>>> Thread.run() line: 619
>>>>
>>>>
>>>> public class NormalModelDistribution implements
>>>> ModelDistribution<Vector>
>>>> {
>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>> return
>>>> result; }
>>>>
>>>> and later this vector is dotted to
>>>>  @Override
>>>>  public double pdf(Vector x) {
>>>>   double sd2 = stdDev * stdDev;
>>>>   double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>> sd2);
>>>>   double ex = Math.exp(exp);
>>>>   return ex / (stdDev * sqrt2pi);
>>>>  }
>>>>
>>>> x vector which is coming from Hadoop MapRunner through the map function:
>>>>
>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>                 OutputCollector<Text, Vector> output, Reporter reporter)
>>>> throws IOException {
>>>>
>>>>
>>>> any idea?
>>>>
>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>> safe
>>>> enough to run against trunk?
>>>>
>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>> [hidden email]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>     Sorry, what does that mean :)?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> It means that there is probably a programming bug somehow.  At the very
>>>>> least, the program is not robust with respect to strange invocations.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> dot product is a vector operation that is the sum of products of
>>>>> corresponding elements of the two vectors being operated on.  If these
>>>>> vectors don't have the same length, then it is an error.
>>>>>
>>>>> what should I investigate?
>>>>>   I am not familiar with the code, but if I had time to look, my
>>>>> strategy
>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>> to
>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>> code
>>>>> in NormalModel will not tell you anything, but you can see which
>>>>> vectors
>>>>> are
>>>>> involved and by walking up the stack you may be able to see where they
>>>>> come
>>>>> from.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>> same
>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>> step
>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>> adjusted
>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> I would think that this sounds very plausible.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>> with
>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>> 0.01,
>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>  The
>>>>> effect of different values should be small over a pretty wide range.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> iterations
>>>>>> and reductions...here is my current argument set:
>>>>>>
>>>>>> args = new String[] {
>>>>>> "--input",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> "--output", config.getClustersDir(),
>>>>>> "--modelClass",
>>>>>>
>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>> "--maxIter", "15",
>>>>>> "--alpha", "1.0",
>>>>>> "--k", config.getClustersCount(),
>>>>>> "--maxRed", "2"
>>>>>> };
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Not off-hand.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


--
Chenmin Liang
Language Technologies Institute, School of Computer Science
Carnegie Mellon University
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Jeff Eastman
In reply to this post by Ted Dunning
Because of its non-deterministic nature, Dirichlet is darn hard to test.
The 2-d tests offer the option of plotting out the points and the models
and eyeballing the result
(http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html) but
more rigorous testing and higher order problems in general are needed.
There was a student on this list last summer who offered some pointed
suggestions but he did not follow up and I've been under water in a startup.


Ted Dunning wrote:

> Because the unit tests were 2-dimensional examples.
>
> On Wed, Jan 13, 2010 at 3:34 PM, Jake Mannix <[hidden email]> wrote:
>
>  
>> Ack, this is bad - why have we not caught this in unit tests?
>>    
>
>
>
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Bogdan94202
In reply to this post by Jeff Eastman
Hi Jeff,

What kind of details do you need to continue?
In the mean time I am anyway going back to kmeans (maybe I really start with
adding canopy to my kmeans only scenario first ;)).

Best regards,
Bogdan

On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <[hidden email]>wrote:

> I think KMeans and Canopy are the most-used and therefore the most robust.
> Dirichlet still has not seen much use beyond some test examples and
> NormalModel has at least one known problem (with sample() only returning the
> maximum likelihood) that has been reported but never fixed. Can you point me
> to the problem you are running so I can try to get up to speed? It has been
> some time since I worked in this code but I'm keen to do so and I have some
> time to invest.
>
> Jeff
>
>
>
> Bogdan Vatkov wrote:
>
>> But I am the first one to use Dirichlet which algorithm is the recommended
>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>
>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[hidden email]
>> >wrote:
>>
>>
>>
>>> The NormalModelDistribution seems to still think all the data vectors are
>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>> Subsequently, when you calculate the pdf with your data value (x) the
>>> sizes
>>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
>>> where n is your data cardinality. Please also look at the rest of the
>>> math
>>> in DenseVector with suspiscion. AFAIK, you are the first person to try to
>>> use Dirichlet.
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>
>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>> NormalModel))
>>>> NormalModel.<init>(Vector, double) line: 48
>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>> line:
>>>> 48
>>>> DirichletDriver.createState(String, int, double) line: 172
>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>> line:
>>>> 150
>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>> line:
>>>> 133
>>>> DirichletDriver.main(String[]) line: 109
>>>> Clusters.doClustering() line: 244
>>>> Clusters.access$0(Clusters) line: 175
>>>> Clusters$1.run() line: 148
>>>> Thread.run() line: 619
>>>>
>>>>
>>>> public class NormalModelDistribution implements
>>>> ModelDistribution<Vector>
>>>> {
>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>> return
>>>> result; }
>>>>
>>>> and later this vector is dotted to
>>>>  @Override
>>>>  public double pdf(Vector x) {
>>>>   double sd2 = stdDev * stdDev;
>>>>   double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>> sd2);
>>>>   double ex = Math.exp(exp);
>>>>   return ex / (stdDev * sqrt2pi);
>>>>  }
>>>>
>>>> x vector which is coming from Hadoop MapRunner through the map function:
>>>>
>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>                 OutputCollector<Text, Vector> output, Reporter reporter)
>>>> throws IOException {
>>>>
>>>>
>>>> any idea?
>>>>
>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>> safe
>>>> enough to run against trunk?
>>>>
>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>> [hidden email]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> wrote:
>>>>>>     Sorry, what does that mean :)?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> It means that there is probably a programming bug somehow.  At the very
>>>>> least, the program is not robust with respect to strange invocations.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> dot product is a vector operation that is the sum of products of
>>>>> corresponding elements of the two vectors being operated on.  If these
>>>>> vectors don't have the same length, then it is an error.
>>>>>
>>>>> what should I investigate?
>>>>>   I am not familiar with the code, but if I had time to look, my
>>>>> strategy
>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>> to
>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>> code
>>>>> in NormalModel will not tell you anything, but you can see which
>>>>> vectors
>>>>> are
>>>>> involved and by walking up the stack you may be able to see where they
>>>>> come
>>>>> from.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>> same
>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>> step
>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>> adjusted
>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> I would think that this sounds very plausible.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>> with
>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>> 0.01,
>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>  The
>>>>> effect of different values should be small over a pretty wide range.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> iterations
>>>>>> and reductions...here is my current argument set:
>>>>>>
>>>>>> args = new String[] {
>>>>>> "--input",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> "--output", config.getClustersDir(),
>>>>>> "--modelClass",
>>>>>>
>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>> "--maxIter", "15",
>>>>>> "--alpha", "1.0",
>>>>>> "--k", config.getClustersCount(),
>>>>>> "--maxRed", "2"
>>>>>> };
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Not off-hand.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Jeff Eastman
I gather you are doing text clustering? Are you using one of our example
datasets or one which is publicly available?


Bogdan Vatkov wrote:

> Hi Jeff,
>
> What kind of details do you need to continue?
> In the mean time I am anyway going back to kmeans (maybe I really start with
> adding canopy to my kmeans only scenario first ;)).
>
> Best regards,
> Bogdan
>
> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <[hidden email]>wrote:
>
>  
>> I think KMeans and Canopy are the most-used and therefore the most robust.
>> Dirichlet still has not seen much use beyond some test examples and
>> NormalModel has at least one known problem (with sample() only returning the
>> maximum likelihood) that has been reported but never fixed. Can you point me
>> to the problem you are running so I can try to get up to speed? It has been
>> some time since I worked in this code but I'm keen to do so and I have some
>> time to invest.
>>
>> Jeff
>>
>>
>>
>> Bogdan Vatkov wrote:
>>
>>    
>>> But I am the first one to use Dirichlet which algorithm is the recommended
>>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>>
>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <[hidden email]
>>>      
>>>> wrote:
>>>>        
>>>
>>>      
>>>> The NormalModelDistribution seems to still think all the data vectors are
>>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>>> Subsequently, when you calculate the pdf with your data value (x) the
>>>> sizes
>>>> are incompatible. Suggest changing 'DenseVector(2)' to 'DenseVector(n)',
>>>> where n is your data cardinality. Please also look at the rest of the
>>>> math
>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try to
>>>> use Dirichlet.
>>>>
>>>>
>>>>
>>>> Bogdan Vatkov wrote:
>>>>
>>>>
>>>>
>>>>        
>>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>>
>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>>> NormalModel))
>>>>> NormalModel.<init>(Vector, double) line: 48
>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>>> line:
>>>>> 48
>>>>> DirichletDriver.createState(String, int, double) line: 172
>>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>>> line:
>>>>> 150
>>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>>> line:
>>>>> 133
>>>>> DirichletDriver.main(String[]) line: 109
>>>>> Clusters.doClustering() line: 244
>>>>> Clusters.access$0(Clusters) line: 175
>>>>> Clusters$1.run() line: 148
>>>>> Thread.run() line: 619
>>>>>
>>>>>
>>>>> public class NormalModelDistribution implements
>>>>> ModelDistribution<Vector>
>>>>> {
>>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>>> return
>>>>> result; }
>>>>>
>>>>> and later this vector is dotted to
>>>>>  @Override
>>>>>  public double pdf(Vector x) {
>>>>>   double sd2 = stdDev * stdDev;
>>>>>   double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>>> sd2);
>>>>>   double ex = Math.exp(exp);
>>>>>   return ex / (stdDev * sqrt2pi);
>>>>>  }
>>>>>
>>>>> x vector which is coming from Hadoop MapRunner through the map function:
>>>>>
>>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>>                 OutputCollector<Text, Vector> output, Reporter reporter)
>>>>> throws IOException {
>>>>>
>>>>>
>>>>> any idea?
>>>>>
>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>>> safe
>>>>> enough to run against trunk?
>>>>>
>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>          
>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>>> [hidden email]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>> wrote:
>>>>>>>     Sorry, what does that mean :)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>> It means that there is probably a programming bug somehow.  At the very
>>>>>> least, the program is not robust with respect to strange invocations.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>> dot product is a vector operation that is the sum of products of
>>>>>> corresponding elements of the two vectors being operated on.  If these
>>>>>> vectors don't have the same length, then it is an error.
>>>>>>
>>>>>> what should I investigate?
>>>>>>   I am not familiar with the code, but if I had time to look, my
>>>>>> strategy
>>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>>> to
>>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>>> code
>>>>>> in NormalModel will not tell you anything, but you can see which
>>>>>> vectors
>>>>>> are
>>>>>> involved and by walking up the stack you may be able to see where they
>>>>>> come
>>>>>> from.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>>> same
>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>>> step
>>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>>> adjusted
>>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>> I would think that this sounds very plausible.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>>> with
>>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>>> 0.01,
>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>>  The
>>>>>> effect of different values should be small over a pretty wide range.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>> iterations
>>>>>>> and reductions...here is my current argument set:
>>>>>>>
>>>>>>> args = new String[] {
>>>>>>> "--input",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>> "--output", config.getClustersDir(),
>>>>>>> "--modelClass",
>>>>>>>
>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>>> "--maxIter", "15",
>>>>>>> "--alpha", "1.0",
>>>>>>> "--k", config.getClustersCount(),
>>>>>>> "--maxRed", "2"
>>>>>>> };
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>> Not off-hand.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>
>>>>>
>>>>>
>>>>>          
>>>>        
>>>
>>>
>>>      
>>    
>
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Bogdan94202
unfortunately I am using private data which I cannot share. I am using
emails, indexed by Solr and then creating vectors out of them. I am using
them with k-means and everything is ok. Just wanted to try out the Dirichlet
algorithm.

On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman <[hidden email]>wrote:

> I gather you are doing text clustering? Are you using one of our example
> datasets or one which is publicly available?
>
>
>
> Bogdan Vatkov wrote:
>
>> Hi Jeff,
>>
>> What kind of details do you need to continue?
>> In the mean time I am anyway going back to kmeans (maybe I really start
>> with
>> adding canopy to my kmeans only scenario first ;)).
>>
>> Best regards,
>> Bogdan
>>
>> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <[hidden email]
>> >wrote:
>>
>>
>>
>>> I think KMeans and Canopy are the most-used and therefore the most
>>> robust.
>>> Dirichlet still has not seen much use beyond some test examples and
>>> NormalModel has at least one known problem (with sample() only returning
>>> the
>>> maximum likelihood) that has been reported but never fixed. Can you point
>>> me
>>> to the problem you are running so I can try to get up to speed? It has
>>> been
>>> some time since I worked in this code but I'm keen to do so and I have
>>> some
>>> time to invest.
>>>
>>> Jeff
>>>
>>>
>>>
>>> Bogdan Vatkov wrote:
>>>
>>>
>>>
>>>> But I am the first one to use Dirichlet which algorithm is the
>>>> recommended
>>>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>>>
>>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <
>>>> [hidden email]
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>> The NormalModelDistribution seems to still think all the data vectors
>>>>> are
>>>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>>>> Subsequently, when you calculate the pdf with your data value (x) the
>>>>> sizes
>>>>> are incompatible. Suggest changing 'DenseVector(2)' to
>>>>> 'DenseVector(n)',
>>>>> where n is your data cardinality. Please also look at the rest of the
>>>>> math
>>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try
>>>>> to
>>>>> use Dirichlet.
>>>>>
>>>>>
>>>>>
>>>>> Bogdan Vatkov wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>>>
>>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>>>> NormalModel))
>>>>>> NormalModel.<init>(Vector, double) line: 48
>>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>>>> line:
>>>>>> 48
>>>>>> DirichletDriver.createState(String, int, double) line: 172
>>>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>>>> line:
>>>>>> 150
>>>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>>>> line:
>>>>>> 133
>>>>>> DirichletDriver.main(String[]) line: 109
>>>>>> Clusters.doClustering() line: 244
>>>>>> Clusters.access$0(Clusters) line: 175
>>>>>> Clusters$1.run() line: 148
>>>>>> Thread.run() line: 619
>>>>>>
>>>>>>
>>>>>> public class NormalModelDistribution implements
>>>>>> ModelDistribution<Vector>
>>>>>> {
>>>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>>>> return
>>>>>> result; }
>>>>>>
>>>>>> and later this vector is dotted to
>>>>>>  @Override
>>>>>>  public double pdf(Vector x) {
>>>>>>  double sd2 = stdDev * stdDev;
>>>>>>  double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>>>> sd2);
>>>>>>  double ex = Math.exp(exp);
>>>>>>  return ex / (stdDev * sqrt2pi);
>>>>>>  }
>>>>>>
>>>>>> x vector which is coming from Hadoop MapRunner through the map
>>>>>> function:
>>>>>>
>>>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>>>                OutputCollector<Text, Vector> output, Reporter
>>>>>> reporter)
>>>>>> throws IOException {
>>>>>>
>>>>>>
>>>>>> any idea?
>>>>>>
>>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>>>> safe
>>>>>> enough to run against trunk?
>>>>>>
>>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>>>> [hidden email]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>    Sorry, what does that mean :)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> It means that there is probably a programming bug somehow.  At the
>>>>>>> very
>>>>>>> least, the program is not robust with respect to strange invocations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> dot product is a vector operation that is the sum of products of
>>>>>>> corresponding elements of the two vectors being operated on.  If
>>>>>>> these
>>>>>>> vectors don't have the same length, then it is an error.
>>>>>>>
>>>>>>> what should I investigate?
>>>>>>>  I am not familiar with the code, but if I had time to look, my
>>>>>>> strategy
>>>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>>>> to
>>>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>>>> code
>>>>>>> in NormalModel will not tell you anything, but you can see which
>>>>>>> vectors
>>>>>>> are
>>>>>>> involved and by walking up the stack you may be able to see where
>>>>>>> they
>>>>>>> come
>>>>>>> from.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>>>> same
>>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>>>> step
>>>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>>>> adjusted
>>>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I would think that this sounds very plausible.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>>>> with
>>>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>>>> 0.01,
>>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>>>  The
>>>>>>> effect of different values should be small over a pretty wide range.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> iterations
>>>>>>>> and reductions...here is my current argument set:
>>>>>>>>
>>>>>>>> args = new String[] {
>>>>>>>> "--input",
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> "--output", config.getClustersDir(),
>>>>>>>> "--modelClass",
>>>>>>>>
>>>>>>>>
>>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>>>> "--maxIter", "15",
>>>>>>>> "--alpha", "1.0",
>>>>>>>> "--k", config.getClustersCount(),
>>>>>>>> "--maxRed", "2"
>>>>>>>> };
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Not off-hand.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>


--
Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Jeff Eastman
Ah, ok, perhaps I will start with something similar and see how far I
can get with Dirichlet.

Bogdan Vatkov wrote:

> unfortunately I am using private data which I cannot share. I am using
> emails, indexed by Solr and then creating vectors out of them. I am using
> them with k-means and everything is ok. Just wanted to try out the Dirichlet
> algorithm.
>
> On Thu, Jan 14, 2010 at 8:49 PM, Jeff Eastman <[hidden email]>wrote:
>
>  
>> I gather you are doing text clustering? Are you using one of our example
>> datasets or one which is publicly available?
>>
>>
>>
>> Bogdan Vatkov wrote:
>>
>>    
>>> Hi Jeff,
>>>
>>> What kind of details do you need to continue?
>>> In the mean time I am anyway going back to kmeans (maybe I really start
>>> with
>>> adding canopy to my kmeans only scenario first ;)).
>>>
>>> Best regards,
>>> Bogdan
>>>
>>> On Thu, Jan 14, 2010 at 1:49 AM, Jeff Eastman <[hidden email]
>>>      
>>>> wrote:
>>>>        
>>>
>>>      
>>>> I think KMeans and Canopy are the most-used and therefore the most
>>>> robust.
>>>> Dirichlet still has not seen much use beyond some test examples and
>>>> NormalModel has at least one known problem (with sample() only returning
>>>> the
>>>> maximum likelihood) that has been reported but never fixed. Can you point
>>>> me
>>>> to the problem you are running so I can try to get up to speed? It has
>>>> been
>>>> some time since I worked in this code but I'm keen to do so and I have
>>>> some
>>>> time to invest.
>>>>
>>>> Jeff
>>>>
>>>>
>>>>
>>>> Bogdan Vatkov wrote:
>>>>
>>>>
>>>>
>>>>        
>>>>> But I am the first one to use Dirichlet which algorithm is the
>>>>> recommended
>>>>> one? Are all other algs better then Dirichlet so no one used it ;)?
>>>>>
>>>>> On Thu, Jan 14, 2010 at 1:23 AM, Jeff Eastman <
>>>>> [hidden email]
>>>>>
>>>>>
>>>>>          
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>            
>>>>>
>>>>>          
>>>>>> The NormalModelDistribution seems to still think all the data vectors
>>>>>> are
>>>>>> size=2.  In SampleFromPrior, it is creating models with that size.
>>>>>> Subsequently, when you calculate the pdf with your data value (x) the
>>>>>> sizes
>>>>>> are incompatible. Suggest changing 'DenseVector(2)' to
>>>>>> 'DenseVector(n)',
>>>>>> where n is your data cardinality. Please also look at the rest of the
>>>>>> math
>>>>>> in DenseVector with suspiscion. AFAIK, you are the first person to try
>>>>>> to
>>>>>> use Dirichlet.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Bogdan Vatkov wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>>>> I see a stack  when the size of the vectore mean is set to 2:
>>>>>>>
>>>>>>> Daemon Thread [Thread-9] (Suspended (breakpoint at line 48 in
>>>>>>> NormalModel))
>>>>>>> NormalModel.<init>(Vector, double) line: 48
>>>>>>> NormalModelDistribution.sampleFromPrior(int) line: 33
>>>>>>> DirichletState<O>.<init>(ModelDistribution<O>, int, double, int, int)
>>>>>>> line:
>>>>>>> 48
>>>>>>> DirichletDriver.createState(String, int, double) line: 172
>>>>>>> DirichletDriver.writeInitialState(String, String, String, int, double)
>>>>>>> line:
>>>>>>> 150
>>>>>>> DirichletDriver.runJob(String, String, String, int, int, double, int)
>>>>>>> line:
>>>>>>> 133
>>>>>>> DirichletDriver.main(String[]) line: 109
>>>>>>> Clusters.doClustering() line: 244
>>>>>>> Clusters.access$0(Clusters) line: 175
>>>>>>> Clusters$1.run() line: 148
>>>>>>> Thread.run() line: 619
>>>>>>>
>>>>>>>
>>>>>>> public class NormalModelDistribution implements
>>>>>>> ModelDistribution<Vector>
>>>>>>> {
>>>>>>> @Override public Model<Vector>[] sampleFromPrior(int howMany) {
>>>>>>> Model<Vector>[] result = new NormalModel[howMany]; for (int i = 0; i <
>>>>>>> howMany; i++) { result[i] = new NormalModel(new DenseVector(2), 1); }
>>>>>>> return
>>>>>>> result; }
>>>>>>>
>>>>>>> and later this vector is dotted to
>>>>>>>  @Override
>>>>>>>  public double pdf(Vector x) {
>>>>>>>  double sd2 = stdDev * stdDev;
>>>>>>>  double exp = -(x.dot(x) - 2 * x.dot(mean) + mean.dot(mean)) / (2 *
>>>>>>> sd2);
>>>>>>>  double ex = Math.exp(exp);
>>>>>>>  return ex / (stdDev * sqrt2pi);
>>>>>>>  }
>>>>>>>
>>>>>>> x vector which is coming from Hadoop MapRunner through the map
>>>>>>> function:
>>>>>>>
>>>>>>>  public void map(WritableComparable<?> key, Vector v,
>>>>>>>                OutputCollector<Text, Vector> output, Reporter
>>>>>>> reporter)
>>>>>>> throws IOException {
>>>>>>>
>>>>>>>
>>>>>>> any idea?
>>>>>>>
>>>>>>> btw, I am running Mahout 0.2...should I move to 0.3 or to trunk? is it
>>>>>>> safe
>>>>>>> enough to run against trunk?
>>>>>>>
>>>>>>> On Wed, Jan 13, 2010 at 10:13 PM, Ted Dunning <[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>> On Wed, Jan 13, 2010 at 11:53 AM, Bogdan Vatkov <
>>>>>>>> [hidden email]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>> wrote:
>>>>>>>>>    Sorry, what does that mean :)?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>> It means that there is probably a programming bug somehow.  At the
>>>>>>>> very
>>>>>>>> least, the program is not robust with respect to strange invocations.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>> what is a dotted vector? and why aren't they the same?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>> dot product is a vector operation that is the sum of products of
>>>>>>>> corresponding elements of the two vectors being operated on.  If
>>>>>>>> these
>>>>>>>> vectors don't have the same length, then it is an error.
>>>>>>>>
>>>>>>>> what should I investigate?
>>>>>>>>  I am not familiar with the code, but if I had time to look, my
>>>>>>>> strategy
>>>>>>>> would be to start in the NormalModel and work back up the stack trace
>>>>>>>> to
>>>>>>>> find out how the vectors came to be different lengths.  No doubt, the
>>>>>>>> code
>>>>>>>> in NormalModel will not tell you anything, but you can see which
>>>>>>>> vectors
>>>>>>>> are
>>>>>>>> involved and by walking up the stack you may be able to see where
>>>>>>>> they
>>>>>>>> come
>>>>>>>> from.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>> I am basically running my complete kmeans scenario (same input data,
>>>>>>>>> same
>>>>>>>>> number of clusters param, etc.) but just replacing KmeansDriver.main
>>>>>>>>> step
>>>>>>>>> with a DirichletDriver.main call...of course the arguments are
>>>>>>>>> adjusted
>>>>>>>>> since kmeans and dirichlet do not have the same arguments.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>> I would think that this sounds very plausible.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>> I am not sure what number I should give for the alpha argument,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>> Alpha should have a value in the range from 0.01 to 20.  I would scan
>>>>>>>> with
>>>>>>>> 1,2, 5 magnitude steps to see what works well for your data.  (i.e.
>>>>>>>> 0.01,
>>>>>>>> 0.02, 0.05, 0.1, 0.2 ... 20).  A value of 1 is a fine place to start.
>>>>>>>>  The
>>>>>>>> effect of different values should be small over a pretty wide range.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>> iterations
>>>>>>>>> and reductions...here is my current argument set:
>>>>>>>>>
>>>>>>>>> args = new String[] {
>>>>>>>>> "--input",
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>> "/store/dev/inst/mahout-0.2/email-clustering/1-solr-vectors/solr_index.vec",
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>> "--output", config.getClustersDir(),
>>>>>>>>> "--modelClass",
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution",
>>>>>>>>> "--maxIter", "15",
>>>>>>>>> "--alpha", "1.0",
>>>>>>>>> "--k", config.getClustersCount(),
>>>>>>>>> "--maxRed", "2"
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>> Not off-hand.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>            
>>>>>
>>>>>
>>>>>          
>>>>        
>>>
>>>
>>>      
>>    
>
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Benson Margulies
OT: Every time I see this go by, I expect to see 'Cardinality' and 'Richelieu'.
Reply | Threaded
Open this post in threaded view
|

Re: CardinalityException in DirichletDriver

Jeff Eastman
In reply to this post by Jeff Eastman
Bogdan,

Recent resolution of MAHOUT-251 should allow you to experiment with
Dirichlet clustering for text models with arbitrary dimensionality. I
suggest starting with the NormalModelDistribution with a large sparse
vector as its prototype.  The other model distributions create sampled
values for all the prior model dimensions, negating any value of using
sparse vectors for their prototypes.

It may in fact be necessary to introduce a new ModelDistribution and
Model so that sparse model elements will not fill up with insignificant
values. After the first iteration computes the new posterior model
parameters from the observations, many of these values will likely be
small so some heuristic would be needed to preserve model sparseness by
removing them altogether. If all these values are retained, it is
probably better to use a dense vector representation. A 50k-dimensional
model will be a real compute hog if it is not kept sparse somehow. Maybe
sampleFromPosterior() or sample() would be good places to embed this
heuristic.

I'll begin writing some tests to experiment with these models.


12