plsi in pig

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

plsi in pig

prasenjit mukherjee-2
Hi,
   I have implemented hofmann's plsi/em algo in pig which I would like
to contribute back to the community for further
scrutinization/improvement.  Let me know if mahout is the appropriate
forum or should  it go to  pig project.

Haven't  seen any non-java contributions to Mahout yet, which begs the
question is Mahout only java based ?

-Thanks,
Prasen
Reply | Threaded
Open this post in threaded view
|

Re: plsi in pig

Grant Ingersoll-2
Hmm, hadn't really thought about it, but I see no reason why we  
wouldn't accept it and add it.  I think our source tree can definitely  
handle it.

I'd propose it go somewhere under:
trunk/core/src/main/pig/plsi

I'm not familiar with Pig, but I can learn, and I know others are, is  
it a single file?

See http://cwiki.apache.org/MAHOUT/howtocontribute.html for  
instructions on contributing.  Basically, just attach the file(s) to a  
JIRA issue.

On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:

> Hi,
>   I have implemented hofmann's plsi/em algo in pig which I would like
> to contribute back to the community for further
> scrutinization/improvement.  Let me know if mahout is the appropriate
> forum or should  it go to  pig project.
>
> Haven't  seen any non-java contributions to Mahout yet, which begs the
> question is Mahout only java based ?
>
> -Thanks,
> Prasen

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: plsi in pig

Sean Owen
Needs to go somewhere like trunk/core/src/pig/main right, versus /java/ ?

I also see no harm in adding it, other than that it would remain
pretty isolated right? isn't part of the build, can't be integrated
with the other code, etc.? Does it add value to package it with the
project then?

Perhaps I misunderstand what Pig can do or how it can relate to Java?

On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <[hidden email]> wrote:

> Hmm, hadn't really thought about it, but I see no reason why we wouldn't
> accept it and add it.  I think our source tree can definitely handle it.
>
> I'd propose it go somewhere under:
> trunk/core/src/main/pig/plsi
>
> I'm not familiar with Pig, but I can learn, and I know others are, is it a
> single file?
>
> See http://cwiki.apache.org/MAHOUT/howtocontribute.html for instructions on
> contributing.  Basically, just attach the file(s) to a JIRA issue.
>
> On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
>
>> Hi,
>>  I have implemented hofmann's plsi/em algo in pig which I would like
>> to contribute back to the community for further
>> scrutinization/improvement.  Let me know if mahout is the appropriate
>> forum or should  it go to  pig project.
>>
>> Haven't  seen any non-java contributions to Mahout yet, which begs the
>> question is Mahout only java based ?
>>
>> -Thanks,
>> Prasen
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
Reply | Threaded
Open this post in threaded view
|

Re: plsi in pig

prasenjit mukherjee-2
Pig is a higher level language ( more like Swazall for Google's
mapreduce )  on top of hadoop which makes hadoop easy to use.

It has SQL like syntaxes and can break the command into separate
mapreduce tasks and also chain them. From execution point of view they
are as simple as running a shell script with very few
operators/commands.

Some of its commands are join, group, cogroup, load etc.

For example the following pig script  takes a logfile in the format :
<txid>,<txt>,<user> and outputs user-term-freq  file in the foll
format : <txt>\t<user>\t<cnt>

raw = load 'tx_log.csv' using PigStorage(',') AS
(transactionid:chararray, txt:chararray, user:chararray);
tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as attribute;
user_term_freq = group tokenized by (user,attribute);
user_term_freq = foreach ratings generate flatten(group),COUNT(tokenized);
store ratings into 'user_term_freq.txt';

During runtime pig takes the input and breaks it into several map and
reduce tasks. It takes the hadoop-site.xml from its classpath.

-Prasen

On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <[hidden email]> wrote:

> Needs to go somewhere like trunk/core/src/pig/main right, versus /java/ ?
>
> I also see no harm in adding it, other than that it would remain
> pretty isolated right? isn't part of the build, can't be integrated
> with the other code, etc.? Does it add value to package it with the
> project then?
>
> Perhaps I misunderstand what Pig can do or how it can relate to Java?
>
> On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <[hidden email]> wrote:
>> Hmm, hadn't really thought about it, but I see no reason why we wouldn't
>> accept it and add it.  I think our source tree can definitely handle it.
>>
>> I'd propose it go somewhere under:
>> trunk/core/src/main/pig/plsi
>>
>> I'm not familiar with Pig, but I can learn, and I know others are, is it a
>> single file?
>>
>> See http://cwiki.apache.org/MAHOUT/howtocontribute.html for instructions on
>> contributing.  Basically, just attach the file(s) to a JIRA issue.
>>
>> On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
>>
>>> Hi,
>>>  I have implemented hofmann's plsi/em algo in pig which I would like
>>> to contribute back to the community for further
>>> scrutinization/improvement.  Let me know if mahout is the appropriate
>>> forum or should  it go to  pig project.
>>>
>>> Haven't  seen any non-java contributions to Mahout yet, which begs the
>>> question is Mahout only java based ?
>>>
>>> -Thanks,
>>> Prasen
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: plsi in pig

Sean Owen
Yeah I had the same high-level impression of what Pig is... if indeed
Pig is to (Java-based) Hadoop as szl is to MapReduce, and the project
is in Java with heavy use of Hadoop, yes I think there's a good
argument to keep it all together -- perhaps just separating the pig
scripts into a separate source root of course.

On Wed, Feb 11, 2009 at 11:40 AM, prasenjit mukherjee
<[hidden email]> wrote:
> Pig is a higher level language ( more like Swazall for Google's
> mapreduce )  on top of hadoop which makes hadoop easy to use.
>
> It has SQL like syntaxes and can break the command into separate
> mapreduce tasks and also chain them. From execution point of view they
> are as simple as running a shell script with very few
> operators/commands.
Reply | Threaded
Open this post in threaded view
|

Re: plsi in pig

Grant Ingersoll-2
In reply to this post by prasenjit mukherjee-2
This is excellent, Prasen.

I see no reason not to include them.  We are about ML first,  
distributed/scalable ML second and Hadoop-based third, IMO.  Java  
would be a distant fourth in my mind.  In other words, I don't feel  
particularly strong about us being Java only or even Hadoop only.  To  
me there is a significant need for community-developed machine  
learning capabilities with a commercial friendly license.  Add in the  
ability to scale/run efficiently and you have a home run.  In fact,  
those are the very reasons we founded Mahout.


On Feb 11, 2009, at 6:40 AM, prasenjit mukherjee wrote:

> Pig is a higher level language ( more like Swazall for Google's
> mapreduce )  on top of hadoop which makes hadoop easy to use.
>
> It has SQL like syntaxes and can break the command into separate
> mapreduce tasks and also chain them. From execution point of view they
> are as simple as running a shell script with very few
> operators/commands.
>
> Some of its commands are join, group, cogroup, load etc.
>
> For example the following pig script  takes a logfile in the format :
> <txid>,<txt>,<user> and outputs user-term-freq  file in the foll
> format : <txt>\t<user>\t<cnt>
>
> raw = load 'tx_log.csv' using PigStorage(',') AS
> (transactionid:chararray, txt:chararray, user:chararray);
> tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as  
> attribute;
> user_term_freq = group tokenized by (user,attribute);
> user_term_freq = foreach ratings generate  
> flatten(group),COUNT(tokenized);
> store ratings into 'user_term_freq.txt';
>
> During runtime pig takes the input and breaks it into several map and
> reduce tasks. It takes the hadoop-site.xml from its classpath.
>
> -Prasen
>
> On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <[hidden email]> wrote:
>> Needs to go somewhere like trunk/core/src/pig/main right, versus /
>> java/ ?
>>
>> I also see no harm in adding it, other than that it would remain
>> pretty isolated right? isn't part of the build, can't be integrated
>> with the other code, etc.? Does it add value to package it with the
>> project then?
>>
>> Perhaps I misunderstand what Pig can do or how it can relate to Java?
>>
>> On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <[hidden email]
>> > wrote:
>>> Hmm, hadn't really thought about it, but I see no reason why we  
>>> wouldn't
>>> accept it and add it.  I think our source tree can definitely  
>>> handle it.
>>>
>>> I'd propose it go somewhere under:
>>> trunk/core/src/main/pig/plsi
>>>
>>> I'm not familiar with Pig, but I can learn, and I know others are,  
>>> is it a
>>> single file?
>>>
>>> See http://cwiki.apache.org/MAHOUT/howtocontribute.html for  
>>> instructions on
>>> contributing.  Basically, just attach the file(s) to a JIRA issue.
>>>
>>> On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
>>>
>>>> Hi,
>>>> I have implemented hofmann's plsi/em algo in pig which I would like
>>>> to contribute back to the community for further
>>>> scrutinization/improvement.  Let me know if mahout is the  
>>>> appropriate
>>>> forum or should  it go to  pig project.
>>>>
>>>> Haven't  seen any non-java contributions to Mahout yet, which  
>>>> begs the
>>>> question is Mahout only java based ?
>>>>
>>>> -Thanks,
>>>> Prasen
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>>> using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: plsi in pig

prasenjit mukherjee-2
So I created a jira-issue :
https://issues.apache.org/jira/browse/MAHOUT-106 and also submitted a
patch along with readme instructions. Please feel free to try out with
different input samples. The default behaviour is to run pig in local
mode. Appreciate any suggestions/reviews.

-Prasen

On Wed, Feb 11, 2009 at 5:32 PM, Grant Ingersoll <[hidden email]> wrote:

> This is excellent, Prasen.
>
> I see no reason not to include them.  We are about ML first,
> distributed/scalable ML second and Hadoop-based third, IMO.  Java would be a
> distant fourth in my mind.  In other words, I don't feel particularly strong
> about us being Java only or even Hadoop only.  To me there is a significant
> need for community-developed machine learning capabilities with a commercial
> friendly license.  Add in the ability to scale/run efficiently and you have
> a home run.  In fact, those are the very reasons we founded Mahout.
>
>
> On Feb 11, 2009, at 6:40 AM, prasenjit mukherjee wrote:
>
>> Pig is a higher level language ( more like Swazall for Google's
>> mapreduce )  on top of hadoop which makes hadoop easy to use.
>>
>> It has SQL like syntaxes and can break the command into separate
>> mapreduce tasks and also chain them. From execution point of view they
>> are as simple as running a shell script with very few
>> operators/commands.
>>
>> Some of its commands are join, group, cogroup, load etc.
>>
>> For example the following pig script  takes a logfile in the format :
>> <txid>,<txt>,<user> and outputs user-term-freq  file in the foll
>> format : <txt>\t<user>\t<cnt>
>>
>> raw = load 'tx_log.csv' using PigStorage(',') AS
>> (transactionid:chararray, txt:chararray, user:chararray);
>> tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as
>> attribute;
>> user_term_freq = group tokenized by (user,attribute);
>> user_term_freq = foreach ratings generate flatten(group),COUNT(tokenized);
>> store ratings into 'user_term_freq.txt';
>>
>> During runtime pig takes the input and breaks it into several map and
>> reduce tasks. It takes the hadoop-site.xml from its classpath.
>>
>> -Prasen
>>
>> On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <[hidden email]> wrote:
>>>
>>> Needs to go somewhere like trunk/core/src/pig/main right, versus /java/ ?
>>>
>>> I also see no harm in adding it, other than that it would remain
>>> pretty isolated right? isn't part of the build, can't be integrated
>>> with the other code, etc.? Does it add value to package it with the
>>> project then?
>>>
>>> Perhaps I misunderstand what Pig can do or how it can relate to Java?
>>>
>>> On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <[hidden email]>
>>> wrote:
>>>>
>>>> Hmm, hadn't really thought about it, but I see no reason why we wouldn't
>>>> accept it and add it.  I think our source tree can definitely handle it.
>>>>
>>>> I'd propose it go somewhere under:
>>>> trunk/core/src/main/pig/plsi
>>>>
>>>> I'm not familiar with Pig, but I can learn, and I know others are, is it
>>>> a
>>>> single file?
>>>>
>>>> See http://cwiki.apache.org/MAHOUT/howtocontribute.html for instructions
>>>> on
>>>> contributing.  Basically, just attach the file(s) to a JIRA issue.
>>>>
>>>> On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
>>>>
>>>>> Hi,
>>>>> I have implemented hofmann's plsi/em algo in pig which I would like
>>>>> to contribute back to the community for further
>>>>> scrutinization/improvement.  Let me know if mahout is the appropriate
>>>>> forum or should  it go to  pig project.
>>>>>
>>>>> Haven't  seen any non-java contributions to Mahout yet, which begs the
>>>>> question is Mahout only java based ?
>>>>>
>>>>> -Thanks,
>>>>> Prasen
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
Reply | Threaded
Open this post in threaded view
|

Re: plsi in pig

Grant Ingersoll-2
Cool will look at this after the release.



On Feb 11, 2009, at 10:09, prasenjit mukherjee <[hidden email]>  
wrote:

> So I created a jira-issue :
> https://issues.apache.org/jira/browse/MAHOUT-106 and also submitted a
> patch along with readme instructions. Please feel free to try out with
> different input samples. The default behaviour is to run pig in local
> mode. Appreciate any suggestions/reviews.
>
> -Prasen
>
> On Wed, Feb 11, 2009 at 5:32 PM, Grant Ingersoll  
> <[hidden email]> wrote:
>> This is excellent, Prasen.
>>
>> I see no reason not to include them.  We are about ML first,
>> distributed/scalable ML second and Hadoop-based third, IMO.  Java  
>> would be a
>> distant fourth in my mind.  In other words, I don't feel  
>> particularly strong
>> about us being Java only or even Hadoop only.  To me there is a  
>> significant
>> need for community-developed machine learning capabilities with a  
>> commercial
>> friendly license.  Add in the ability to scale/run efficiently and  
>> you have
>> a home run.  In fact, those are the very reasons we founded Mahout.
>>
>>
>> On Feb 11, 2009, at 6:40 AM, prasenjit mukherjee wrote:
>>
>>> Pig is a higher level language ( more like Swazall for Google's
>>> mapreduce )  on top of hadoop which makes hadoop easy to use.
>>>
>>> It has SQL like syntaxes and can break the command into separate
>>> mapreduce tasks and also chain them. From execution point of view  
>>> they
>>> are as simple as running a shell script with very few
>>> operators/commands.
>>>
>>> Some of its commands are join, group, cogroup, load etc.
>>>
>>> For example the following pig script  takes a logfile in the  
>>> format :
>>> <txid>,<txt>,<user> and outputs user-term-freq  file in the foll
>>> format : <txt>\t<user>\t<cnt>
>>>
>>> raw = load 'tx_log.csv' using PigStorage(',') AS
>>> (transactionid:chararray, txt:chararray, user:chararray);
>>> tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as
>>> attribute;
>>> user_term_freq = group tokenized by (user,attribute);
>>> user_term_freq = foreach ratings generate  
>>> flatten(group),COUNT(tokenized);
>>> store ratings into 'user_term_freq.txt';
>>>
>>> During runtime pig takes the input and breaks it into several map  
>>> and
>>> reduce tasks. It takes the hadoop-site.xml from its classpath.
>>>
>>> -Prasen
>>>
>>> On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <[hidden email]> wrote:
>>>>
>>>> Needs to go somewhere like trunk/core/src/pig/main right, versus /
>>>> java/ ?
>>>>
>>>> I also see no harm in adding it, other than that it would remain
>>>> pretty isolated right? isn't part of the build, can't be integrated
>>>> with the other code, etc.? Does it add value to package it with the
>>>> project then?
>>>>
>>>> Perhaps I misunderstand what Pig can do or how it can relate to  
>>>> Java?
>>>>
>>>> On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <[hidden email]
>>>> >
>>>> wrote:
>>>>>
>>>>> Hmm, hadn't really thought about it, but I see no reason why we  
>>>>> wouldn't
>>>>> accept it and add it.  I think our source tree can definitely  
>>>>> handle it.
>>>>>
>>>>> I'd propose it go somewhere under:
>>>>> trunk/core/src/main/pig/plsi
>>>>>
>>>>> I'm not familiar with Pig, but I can learn, and I know others  
>>>>> are, is it
>>>>> a
>>>>> single file?
>>>>>
>>>>> See http://cwiki.apache.org/MAHOUT/howtocontribute.html for  
>>>>> instructions
>>>>> on
>>>>> contributing.  Basically, just attach the file(s) to a JIRA issue.
>>>>>
>>>>> On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
>>>>>
>>>>>> Hi,
>>>>>> I have implemented hofmann's plsi/em algo in pig which I would  
>>>>>> like
>>>>>> to contribute back to the community for further
>>>>>> scrutinization/improvement.  Let me know if mahout is the  
>>>>>> appropriate
>>>>>> forum or should  it go to  pig project.
>>>>>>
>>>>>> Haven't  seen any non-java contributions to Mahout yet, which  
>>>>>> begs the
>>>>>> question is Mahout only java based ?
>>>>>>
>>>>>> -Thanks,
>>>>>> Prasen
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/
>>>>> Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>