vector generation

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

vector generation

Patterson, Josh
While reading through the wiki and article material on mahout, I noticed
that there was a pre-generation step where vectors were being generated
from either text with Lucene or ARFF with
org.apache.mahout.utils.vectorsarff.driver.java; Looking at the k-means
driver and mapper (KMeansMapper.java) I noticed that the mapper is
taking a key and then a Vector (point) as input.

 

Would it be smart or practical to make a special record reader for your
file format that read your data in as vectors directly and emitted
vectors to the mapper in order to skip the pre-generation step? Just
curious about that, maybe I'm missing something there, or vectorization
would be cumbersome in that position, etc.

 

Also, in Grant's article on Mahout he includes the vectorized 2.5 GB
file from Wikipedia that is in the correct format via Lucene to work
with a Mahout clustering algorithm; Is there a smaller (sub 100 meg)
version of this that I could play around with? I'm working with basic
building blocks right now and figuring out the facets of vectorization
with respect to Mahout so we can learn the base case  (lucene vectors)
and then move on to our specific case (sensor time series data).

 

Josh Patterson

TVA

Reply | Threaded
Open this post in threaded view
|

Re: vector generation

Grant Ingersoll-2

On Nov 24, 2009, at 10:32 AM, Patterson, Josh wrote:

> While reading through the wiki and article material on mahout, I noticed
> that there was a pre-generation step where vectors were being generated
> from either text with Lucene or ARFF with
> org.apache.mahout.utils.vectorsarff.driver.java; Looking at the k-means
> driver and mapper (KMeansMapper.java) I noticed that the mapper is
> taking a key and then a Vector (point) as input.
>
>
>
> Would it be smart or practical to make a special record reader for your
> file format that read your data in as vectors directly and emitted
> vectors to the mapper in order to skip the pre-generation step? Just
> curious about that, maybe I'm missing something there, or vectorization
> would be cumbersome in that position, etc.

Probably would be useful.  No one has taken the steps yet.

>
>
>
> Also, in Grant's article on Mahout he includes the vectorized 2.5 GB
> file from Wikipedia that is in the correct format via Lucene to work
> with a Mahout clustering algorithm; Is there a smaller (sub 100 meg)
> version of this that I could play around with? I'm working with basic
> building blocks right now and figuring out the facets of vectorization
> with respect to Mahout so we can learn the base case  (lucene vectors)
> and then move on to our specific case (sensor time series data).

Here's what I did:
Using Solr, create an index, make sure you turn on term vectors for the appropriate fields.
Point the Lucene Driver at the index and create the vectors.  

You could do this even using the Solr tutorial (solr/example) which would give you an index of about 20 docs.



Here's the schema.xml I used (or, at least the relevant field definitions):
<field name="docid" type="string" indexed="true" stored="true" required="true"/>
        <field name="file" type="string" indexed="true" stored="true" />

        <field name="doctitle" type="text" indexed="true" stored="true" multiValued="true" termVectors="true"/>
        <field name="body" type="text" indexed="true" stored="true" multiValued="true" termVectors="true"/>
        <field name="docdate" type="date" indexed="true" stored="true" multiValued="false"/>


        <field name="titleBody" type="text" indexed="true" stored="false" multiValued="true" termVectors="true"/>

        <field name="spell" type="text_spell" indexed="true" stored="false" multiValued="true"/>
        <!-- Here, default is used to create a "timestamp" field indicating
           When each document was indexed.
        -->
        <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

I also used the EnwikiDocMaker from Lucene's contrib/benchmark plus a simple SolrJ wrapper.