LDA example missing from 0.9, and I can't get it to work

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

LDA example missing from 0.9, and I can't get it to work

brian4
This post has NOT been accepted by the mailing list yet.
The webpage on LDA says to look at the examples directory (examples/bin) for an example of how to run LDA (cvb), however there is no such example in the 0.9 distribution I downloaded and I could find no documentation on the mahout website.

Nevertheless by reading the pages on LDA on the mahout website providing the command line arguments, and searching the internet, I've made several attempts to try and run it on the Reuters data set, none of which have worked.

I am not using hadoop so I tried both "cvb" and "cvb0_local" but neither worked.

My process to get both forms of term vectors:

$MAHOUT seqdirectory -i ${WORK_DIR}/reuters-out -o ${WORK_DIR}/reuters-out-seqdir -c UTF-8 -chunk 64 -xm sequential

$MAHOUT seq2sparse -i ${WORK_DIR}/reuters-out-seqdir -o ${WORK_DIR}/reuters-out-termvecs -wt tf

$MAHOUT rowid \
    -i ${WORK_DIR}/reuters-out-termvecs/tf-vectors \
    -o ${WORK_DIR}/reuters-sparse-vectors-cvb \


I then tried the "cvb" command but I get a NullPointerException from the CVB0Driver call to dictionaryPath.getFileSystem(conf);:
$MAHOUT cvb \
    -i ${WORK_DIR}/reuters-sparse-vectors-cvb \
    -o ${WORK_DIR}/reuters-out-topic-term-dist \
    -dt ${WORK_DIR}/reuters-out-doc-topic-dist \
    -mt ${WORK_DIR}/reuters-out-model-states \
    -k 300 -ow -x 200 \

I assumed this had to do with not using hadoop so i switched to cvb0_local.  However, using either the original term-vectors or the rowId-converted term vectors I get exceptions in both cases.
$MAHOUT cvb0_local \
    -i ${WORK_DIR}/reuters-sparse-vectors-cvb \
    -to ${WORK_DIR}/reuters-out-topic-term-dist \
    -do ${WORK_DIR}/reuters-out-doc-topic-dist \
    -top 300 \
   
$MAHOUT cvb0_local \
    -i ${WORK_DIR}/reuters-out-termvecs/tf-vectors \
    -d ${WORK_DIR}/reuters-out-termvecs/dictionary.file-0 \
    -to ${WORK_DIR}/reuters-out-topic-term-dist \
    -do ${WORK_DIR}/reuters-out-doc-topic-dist \
    -top 300 \

In both cases I get: "org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.math.VectorWritable"

Anyone know what I'm doing wrong, or is cvb just broken for non-hadoop running?